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Abstract 

Reversible circuits for modular multiplication Cx%M with x < M arise as components of modular 
exponentiation in Shor's quantum number-factoring algorithm. However, existing generic construc- 
tions focus on asymptotic gate count and circuit depth rather than actual values, producing fairly 
large circuits not optimized for specific C and M values. In this work, we develop such optimizations 
in a bottom-up fashion, starting with most convenient C values. When zero-initialized ancilla regis- 
ters are available, we reduce the search for compact circuits to a shortest-path problem. Some of our 
modular-multiplication circuits are asymptotically smaller than previous constructions, but worst- 
case bounds and average sizes remain Q{n^). In the context of modular exponentiation, we offer 
several constant-factor improvements, as well as an improvement by a constant additive term that 
is significant for few-qubit circuits arising in ongoing laboratory experiments with Shor's algorithm. 

1 Introduction 

The pursuit of quantum computation |15| has generated both excitement and controversy, while pro- 
ducing few compelling empirical demonstrations so far. Adiabatic computing experiments by DWave 
Systems were sharply criticized for not demonstrating quantum entanglement and not solving hard prob- 
lem instances that would confound best known problem-specific algorithms on non-quantum computers. 
Several academic groups implemented Shor's number-factoring algorithm on several qubits to factor the 
number 15, recalling that asymptotic worst-case complexity of Shor's algorithm is polynomial while best 
known number-factoring algorithms for non-quantum computers take more than polynomial time to run, 
both in theory and inpractice. Experiments with photonic quantum gates |121 llll 116] suggest the pres- 
ence of entanglement!^ but leave unclear how entanglement is going to scale in larger systems. Recent ion 
traps decrease per-gate error rates below the threshold estimate for fault-tolerant quantum computing 
[7] , making sophisticated quantum algorithms more practical if appropriate quantum error-correction is 
used. Shor's algorithm remains the best candidate for benchmarking quantum algorithms because {i) it 
solves a practical problem for which optimized non-quantum software is also available, (m) it has been 
thoroughly studied, and (in) it can be implemented with several known circuits. 

Reducing the size of quantum circuits required by Shor's algorithm [51l20| — the focus of our work — 
decreases resource requirements for future quantum computers in a non-linear wBy because larger circuits 
entail heavier overhead for quantum error-correction |21| . In comparisons to non-quantum number- 
factoring software, smaller circuits can make quantum computers more competitive. However, quantum 
simulators 123] can also run faster on smaller circuits. The significance of simulation in benchmarking 
quantum algorithms is twofold. First, simulation can help studying intermediate states generated by a 
quantum algorithm and estimate the amount of quantum entanglement in these states. Second, simulators 
can be viewed as competing non-quantum algorithms. While this aspect of simulation is often dismissed 
a priori, an instructive example is given by the Quantum Fourier Transform (QFT). It was recently 
discovered that QFT can be efficiently simulated when used stand-alone [1] I24j (but not as part of 
Shor's algorithm) and thus does not offer a quantum speed-up, despite generating a significant amount 
of entanglement. This unexpected result was obtained independently by several researchers [TJ [21] by 
optimizing approximate QFT circuits for a specific simulation technique |13j . 
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^Similar results were shown by simulation for semiconductor nanostructures |8]- 
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Figure 1: An outline of the quantum part of Shor's algorithm. 
1.1 Shor's algorithm 

Shor's algorithm seeks to factor a given value M > 0, which we assume to be semiprime M = pq with 
unknown factors. The strategy is to consider the functions fb{x) = potentially with several 

different 1 < b < M values and determine their periods in case gcd(6, M) = 1. When the period is 
determined to be even h'^'^VoM = 1, we have [If - l){h'' + \)%M = 0, thus either {b^ - 1) or {h^ + 1) 
must share at least one prime factor with M. If V%M ^ —1, such a factor can be found using 
gcd(6'^ ± 1, M), otherwise it leads to the trivial factors 1 and AI . When the period is determined to be 
odd, another b value is tried. 

The period- finding procedure relies on a quantum circuit (Figure [1]), instantiated for a given value 
1 < b < M coprime with M . The circuit operates on two 0-initialized quantum registers |15j with 

• a block of parallel Hadamard gates on Register 1, 

• a circuit for modular exponentiation (mod-exp) evaluates f{y) = b^%M by mapping \y)\0) 1— > 
\y)\f{y))i where y is read from Register 1 and f{y) is written to Register 2; Register 1 can be 
temporarily modified, but must be restored at the end, 

• a circuit for the Quantum Fourier Transform (QFT) on Register 1, 

• a block of parallel measurements on Register 1. 

The first and last blocks cannot be optimized any further. QFT circuits are understood fairly well 
and arc much smaller than circuits for modular exponentiation [15]. Therefore, our focus is on mod-exp 
circuits. They typically consist of reversible gates — NOT (Af), CNOT (C) and Toffoli (T) — which can 
be modeled and optimized entirely in terms of Boolean logic |17j . However, in physical implementations, 
Toffoli gates must be decomposed into smaller gates directly implementable in a given technology |18j . 
Reversible circuits for modular exponentiation start with an inverter on Register 2 that changes the 
|000 • • • 0) value to |000 • • • 1), and otherwise exhibit the following structure: each (i-th) bit of Register 
1 enables (controls) a circuit block that multiplies Register 2 by = b^ %M and reduces the result 
%M. When b and M are known, Ci can be pre-computed without quantum computation. Therefore, 
we refer to CiX%M-blocks below. They are typically implemented using shift and addition circuits, and 
a number of relevant quantum adders are known [SJ I19| . The selection of appropriate adder types is 
discussed in pOlfTU] . 

Each controlled modular multiplication is traditionally implemented separately. When dealing with 
reversible logic and quantum circuits, we note that the coprimality of C and M makes x 1— >■ Cx%M a 
reversible transformation. The number of coprime C values is ip{M) = (p — l){q — 1), where (p{M) is the 
Euler's totient function and gives the size of (Z/MZ)^ — the multiplicative group of integers mod-Af. 
For M ~ 15, modular multiplication circuits for the eight C coprime values are illustrated in Figure [5J 
Figure [3] shows circuits for f{x) = 6^%15, gcd(6, 15) = 1. 

When not knowing p and q, one should also not assume any knowledge that would make it easy to 
find them. For example, one should not choose C that satisfies C^'^ = 1%M with a known (small) tt 
because such solutions would allow one to factorize M via gcd(C'^ ± 1, AI). Also recall that (Z/A/Z)^ is 
a product of two cyclic groups Z/pZ and Z/gZ, and thus (Z/A/Z)'^ admits a generating set with only 
two elements. However, knowing such generators is tantamount to knowing p and q. When working 
with specific small Af = pq, it is sometimes difficult to avoid using the knowledge of p and q, but results 
obtained this way do not necessarily scale to large values. The same can be said about results produced 
through exhaustive search. 



^Here and in the remaining text, the percent sign % denotes the modulo (remainder) operation, as it docs in the C and 
CH — h languages. 
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1.2 Known circuits for modular multiplication by a constant 



We now outline several approaches to modular multiplication by a constant and point out their potential 
inefRciencies. It is commonly agreed that techniques that give asymptotically the smallest gate counts 
(based on Fast Fourier Transforms) are not practical for up to several hundred bits, and we do not discuss 
them here. Karatsuba multiplication also does not appear competitive, as far as we can tell. 

Multiplication by a known constant C (not modular) can be implemented as a sequence of alter- 
nating shifts and additions, where one of the addends is always x [6]. When C is even, we factor out a 
power of two and accumulate a multiplication by a power of two, leaving a smaller odd constant C . For 
an odd constant C > 1, we subtract one and accumulate a -l-a; operator, leaving a smaller even constant 
C" . This process stops at 1 and essentially traverses the binary expansion of C from the least significant 
bit to the most significant bit, resulting in rti — 1 additions when the binary expansion of C includes ni 
non-zero bits. On average, an n-bit number has n/2 non-zero bits. An improvement is possible by also 
using —X operators. When C is odd, we consider C%4. When C%4 = 1, we proceed as above. When 
C%4 = 3, we add one and accumulate a ~x operator. This step may temporarily increase an odd con- 
stant by one, but always results in constants divisible by four, so the next step will decrease it by more. 
This algorithm essentially constructs the so-called Canonical Signed Digit (CSD) representation [31 E] 
that prohibits adjacent non-zero bits. Thus, the number of additions and subtractions cannot exceed n/2 
and averages n/3. For example, consider 39=06100111. Rather than expand 39a; = 2(2(8a; -\- x) -\- x) + x 
with three additions, we can expand 39x = 8(4a;-|-a:) — x with only two addition/subtraction operations. 

Computing Cx%M by reversible circuits through binary or CSD expansion of C poses several 
challenges. This technique is based on the operation 2^x + x and thus requires a modular adder circuit 
with two (unknown) arguments, and must also copy the x value to a separate register. A single (ancilla) 
register suffices, but clearing it (reinitializing to 0) after the additions requires effort. As we explain 
below, clearing the ancillae requires another modular-multiplication circuit. For constants C with sparse 
CSD expansion, the ancillac-clcaring circuit can be much larger than the multiplication itself, as its CSD 
expansion is unlikely to be sparse. In general, the second circuit requires on the order of -n? gates for 
n-bit arguments, and is the same size as the first circuit, on average. 

Computing Cx%M using binary expansion of x, rather than cH entails chaining ^'-controlled 
mod-i\/ additions of constants (2'C)%A/. As shown in Section[21 such reversible modular addition-of-a- 
constant circuits can be simplified for each particular constant and require a single (register) argument 
rather than two (cf. previous paragraph). Even for the simplest constants, quadratically many gates 
are required. Moreover, x cannot be modified while modular additions are controlled by the bits of x. 
Therefore, the additions must be accumulated in a separate register, which again requires clearing the 
garbage ancillae. 

Clearing garbage ancillae. In reversible circuits, 0-initialized ancillae must be cleared by each circuit 
block (except, possibly, the last), but some blocks produce garbage hits. For example, using traditional 
implementations of constant-multiplication as a sequence of shifts and adds requires creating a copy 

Given that x is not a constant, its CSD expansion is not easily available and cannot be used in combinational multi- 
plication circuits. 



Xq 

X2 
X3 



C = 1 C = 2 



C = i 



C = 8 C = 14 



(a) 



C = 13 



(6) 



C = 11 



Vo 
VI 

V2 

ya 



C = 7 



Figure 2: Circuits for f{x) = Cx%M, M = 15, gcd(C, M) = 1, (a) C = 2^=, (b) C = M - 2'' 



gates 



indicate inverters, while the two-bit gates are bit-swaps. All gates are linear and can thus be simulated 
on an initial full-superposition state using the stabilizer formalism jl5) . 
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Figure 3: Circuits for f{x) = b^%15, gcd(6, 15) = 1. These circuits compare favorably with better-known 
circuits in terms of controlled-SWAP gates, because each controlled- SWAP is worth three TofFoli gates. 



of the input, e.g., to compute 3x — 2x + x. However, clearing this copy (using the result of multi- 
plication) essentially requires a division operation. In the context of modular multiplication Cx%M 
with gcd(C, M) = 1, division can be performed as multiplication by the modular inverse C~^%M pre- 
computcd by the extended Euclidean GCD algorithm [6]. We will now show how this approach was 
developed by Bennett to construct reversible modular multiplication circuits that clear their ancillae 
[m Section II], [SJ Formulae 4.4-4.6]. Assume a reversible circuit computing g{x) = Cx%M using a 
copy-register: 

U, : \x)\0) ^ \x)\g{x)) or : |x>|0) ^ \g{x))\x} (1) 



When gcd(C, M) = 1, the function g{x) is reversible, and the same construction can be applied to 
g~^{z) = C~^z%M, where is the modular inverse of C. 

: \z)\0) ^ \z}\g-'{z)) (2) 

when z = g(x), we get 

U^-, ■.\gixm^\g{x)}\x) (3) 

A reversible circuit can be reversed — by reversing the order of the gates and replacing each gate with 
its inverse, keeping in mind that the gates NOT, CNOT and Toffoli are self-inverse. 

U;\ : \g{x))\x) ^ \g{xm (4) 
Applying [/^\ after Ug replaces x with g{x) and leaves the copy-register initialized to 0. 

g y 

U;\-U^ -.[xm^lgixm (5) 

Unfortunately, when C has sparse binary or CSD expansion, it is unlikely, in general, that so will 
C~^%M. Thus, for constants like 2, 8, 17 and 63, not only we have to implement two multiplications 
rather than one, but the second one may require a much larger circuit, and the number of ancillae can 
be significant. 



1.3 Modular exponentiation circuits 

A number of mod-exp circuits have been proposed in the literature. The traditional approach is to im- 
plement each controlled modular multiplication separately and chain these operations. Circuits used in 
laboratory experiments with several qubits typically use the following shortcut. Since modular multipli- 
cations in mod-exp start with the value 1, the number of possible outcomes after the first k multiplications 
is at most 2*^. Therefore, for k — 1,2, one can conditionally produce these outcomes without performing 
multiplication. This observation is also useful when many qubits are available, but one seeks to decrease 
the depth of the circuit rather than gate counts. In this case, one can establish one register for each 
conditional multiplication Cx%M and use CNOT gates in each register to conditionally replace the 
initial value 1 with C. All these operations are done in parallel and followed by a tree of multipliers. At 
the cost of a several-fold increase in gate counts and an asymptotic increase in bitlines (from linear to 
quadratic), circuit depth reduces from linear to logarithmic. As long as bitlines are the most valuable 
and limited resource of quantum computers, this parallel approach remains impractical. 
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1.4 Paper outline 

Basic circuit blocks for addition, comparison and modular reduction are introduced in Section[2] Based on 
these blocks, we develop multiplicative blocks in Section[31 such as inversion, division with remainder, and 
multiplication by constants. In several important cases, we develop linear-sized modular multiplication 
circuits which were not known before. Whereas traditional circuit-synthesis algorithms [17] operate at 
the bit level, we introduce word- level algorithms that perform dramatically better. Section U] proposes a 
new approach for building Cx%M circuits based on modular decomposition of a; that can be implemented 
by compact circuits in some cases. Section [5] defines several circuit operators for producing additional 
circuits. Examples are given in Section [B] Section [7] proposes circuits for modular exponentiation, based 
on techniques from earlier sections. Section [5] shows examples. 

2 Additive circuit blocks 

Key arithmetic blocks used by modular multiplication are adders, subtractors and comparators, along 
with their controlled variants. Such circuit blocks are well-known for conventional digital logic, but must 
be adapted to the reversible context so as to avoid explicit fanout and minimize the number of ancillae. 
We introduce such reversible blocks below and illustrate several possible circuit optimizations. One such 
optimization deals with the insertion of control (enable) signals. 

Addition and subtraction. A number of adder circuits developed in the literature can be used in 
our constructions. To this end, Takahashi [12] describes several other adders with different trade-offs 
between circuit size, circuit depth and the required number of ancillae. To be specific, we are using 
linear-sized adders by Cuccaro et al [9], illustrated in Figure HJd, which are the smallest known. They 
are built using MAJ and UMA blocks shown in Figure S^. An n-bit Cuccaro adder requires 2n Toffoli 
gates and 4n + l CNOT gates. Subtraction can be evaluated using bitwise negation as {x — y) = {x' + y)' 
or as (x — y)%M = {x + {M — y))%M. The latter formula becomes competitive when the minuend y is 
known and Al — y contains more bits than does y. 

Controlled addition. The structure of Cuccaro adders facilitates controlled addition with a smaller 
overhead. The straightforward solution is to enable such an adder by adding a control to every gate, 
requiring Toffoli gates with three controls that need to be broken down into smaller gates. A more 
economical solution is to disable a Cuccaro adder by (1) disabling the middle CNOT gate by adding a 
control, (2) ensuring that the matching MAJ and UMA gates cancel out. A close inspection of MAJ and 
UMA gates suggests that their Toffoli gates and their middle CNOT gates cancel out. The outer CNOT 
gates can be disabled by adding controls, turning them into Toffoli gates, as illustrated by CMAJ and 
CUMA blocks in Figure S^. Thus, an n-bit controlled addition is possible with 4n+l Toffoli gates and 
2n CNOTs. 

Controlled addition of a constant (not modular). A known n-hii value with ni non-zero bits can 
be set on zero-initialized ancillae using ni inverters. The adder may modify those values temporarily, 
but restores them at the end, which allows one to restore the ancillae lines to zeros for use in subsequent 
circuit blocks. Some of these inverters cancel out in the final circuit. When a control input of a gate 
is known to be or 1, the gate can be simplified or removed entirely, as shown in Figure H};. Such 
optimizations can be performed by a straightforward circuit traversal. Not counting some of the above 
simplifications, such a circuit requires no more than 3n — 5 Toffoli gates, ni + 2 CNOT gates, 2ni 
inverters. Given a constant Co, one can compute M — Cq and compare possible circuits for adding Co 
and subtracting Al — Cq. 

Comparators a < b are similar to subtractors — one subtracts a — 6 and checks a — b < 0. Cuccaro 
adders can be modified to perform comparison, leaving their data inputs unchanged and producing a 
one-bit result as the most significant carry-bit of subtraction. Therefore, after the MAJ blocks, one 
uses inverse MAJ blocks instead of UMA blocks used in adders. When comparing to a known constant, 
simplifications are possible as in Figure Comparing to a known n-hit constant with ni non-zero bits, 
such circuits require no more than 2n — 2 Toffoli gates, 3 CNOT gates, and 4rii inverters. 

Modular reduction x%M for x < 2M can be performed with one comparator and one conditional 
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(a) MAJ and UMA blocks used in addition, subtraction and comparator circuits. 
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(b) A 4-bit Cuccaro adder (9| based on MAJ and UMA bloclis. 
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Figure 4: Building blocks for addition, subtraction, and comparator circuits; &i (0 < z < n — 1) is the 
i-th bit. In (c), Cout is discarded. In (c) and (d), Cm = for the first MAJ and CMAJ block. The order 
of lines in (c) and (d) is the same. 



subtraction, connected serially with at most 5n — 7 Toffoli gates, ni + 5 CNOT gates, and 8ni inverters 
(2ni inverters to set and reset ancillae before and after the computation). Figure [5^ shows such a circuit 
and exhibits additional gate optimizations at the interface between the comparator and the subtractor. 
The inverter on X3 is the result of simplifying a CNOT gate in a Cuccaro adder. Figure [S}d shows further 
optimizations using Toffoli gates with negative controlsQ Figure [H] illustrates controlled modular 
reduction, where the comparator and the subtractor remain intact, but the result of comparison is 
conditioned on the new control using a new ancilla. This ancilla is cleared at the end, but the garbage 
output 7 inherited from uncontrolled modular reduction remains. Modular reduction for x <C M is 
discussed in Section [3] under division with remainder. 

Conditional modular addition of a constant. The straightforward implementation y = {x + a)%M 
by adding a constant and then performing mod-Af reduction clears the added carry bit, but leaves a 
garbage bit. This bit can be cleared by (y < a). To avoid the carry, precompute a a/ = AI — a, use 
y = {x < aM X + a : X — aj\/ and clear the ancilla via {y < a). Comparators optimized for x < om 
may be smaller than those for x < M. 

'*In practice, CNOTs and Toffoli gates with negative controls may be as easy to implement as the gates with positive 
controls. Otherwise, additional inverters around the controls suffice. Negative controls not only result in more compact 
circuit diagrams, but can also help reading such circuits. Recall that positively-controlled Tgates with targets on 0- 
initialized ancillae compute the AND function ab © = ab. Using negative controls and a 1-initialized ancilla computes the 
OR function: a'b' ©l = (a-|-6)®0 = a + 6. 

^The ternary conditional a ? 6 : c is similar to if (a) then b else c, but is more flexible. It returns the value of b or 
the value of c (it can be an 1- value). 
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Figure 5: (a) A circuit for modular reduction with M = 21 (after optimization) consisting of a comparator 
and subtractor based on the Cuccaro construction [HI. For x > 21, 7 = 1 which activates the subtractor 
block. Using this circuit as a building block may require clearing or avoiding the garbage output bit 7 
as in Figure [SI (b) Further optimization using negative controls, shown with hollow circles. 
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Figure 6: Modular reduction with M = 21 controlled by The last Toffoli gate clears the added 
ancilla. Compared to Figure [SJj, added are one ancilla and two Toffoli gates incident to of which the 
first enables subtraction and the second clears the ancilla. 



3 Multiplicative circuit blocks 

We now develop several circuits for Cx%M and related operations, using additive building blocks from 
Section H 

Circuits for (2''' + l)a- (not modular) can be constructed by shifts and adds, but the challenge is to 
avoid unnecessary garbage ancillae. Our circuits are structured as follows. For bit values Xi {i < n) 
of X, the bit values of (2*^ + l)x, 5^, can be constructed by a fc-bit shift of x followed by an n + fc 



bit add (i.e., 2'^x 



The addition can be performed by a generic Cuccaro adder — 2 x on main 



qubits, X on ancillae, — but clearing these ancillae is difficult. Another approach is to construct logical 
sub-expressions for output bit i based on the bit values of .ti • • • x^. Formula [5] gives sub-expressions for 
each Si bit. To calculate each 5*^, we precomputc the incoming carry q in Formula [5] and store it on an 
ancilla. For n such ancillae, we need at most 3n Toffoli gates. To construct Si values, at most 3n CNOT 
gates suffice. To clear the Ci ancillae after use, the Toffoli gates that computed them are performed in 
reverse (their inputs did not change). With the additional in Toffoli gates to clear ancillae, a circuit for 
{2^ + \)x needs up to 6n Toffoli gates and in CNOTs. Figure IHb illustrates a 4-bit 3a; circuit after two 
optimizations: (i) literal reduction in Si and Ci sub-expressions, and [ii) absorbing inverters in Toffoli 
gates with negative controls. 



Xi Xi — k © Ci 
Xi — k Ci 
Xi-k-lCi~l 
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Circuits for —x%M. For bitwise negation x', recall x' = 2" — 1 — x. Therefore, (x -|- M')' = 2" — 1 — 
(x + M') = 2"-l-x-(2"-l-M) = M-x = -x%M. Therefore, for x > M, -x%M can be computed 
as (x + M')' using one Cuccaro adder, as illustrated in Figure[7]for M = 21 and M' = 31 — 21 = 10. The 
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Figure 7: Circuits for —x%21 based on a 5-bit Cuecaro adder: (a) Inverters on the zero-initialized ancillae 
prepare the value 10=31-21=21', (b) an optimized circuit, (c) further optimization using negative controls. 



proposed circuit maps x = 0, x = M, and x > M into A/, 0, and 2" + {M — x), respectively. Note that 
inverting any circuit for —x%M will produce a circuit computing —x%M because (M — (M — x)) = x. 
A circuit for conditional —x%M can be constructed by converting each inverter to a CNOT gate and 
applying the conditional modular reduction discussed in Section [5] and illustrated in Figure IH] 

Circuits for 2^x%M for odd M > 2. We start with a linear-sized circuit for 2x%M that clears its 
ancillaeH The bulk of our 2x%M circuit computes x%\M/2~\ using a modular-reduction circuit we 
described earlier, which evaluates x > [M/2] on a 0- initialized ancilla, but also zeros out the most 
significant bit. To multiply x%\M/2'] by two, it sufHces to rotate the bits, which moves the most 
significant zero into the least significant position. One also needs to (a) change the LSB to 1 conditional 
on the ancilla — this can be done with a CNOT, (b) clear the ancilla conditional on the LSB — this 
can be done with another CNOT. This circuit is illustrated in Figure |5^. Further circuit optimization 
uses three tricks. One is the merger of inverters into negative controls (shown with hollow circles) of 
Toffoli gates, this may benefit from creating pairs of canceling inverters and/or moving inverters through 
targets of CNOT/Toffoli gates. The second optimization deals with the two CNOTs at the end of the 
circuit. It creates a pair of canceling CNOTs prior to them, so that three CNOTs can be combined into 
a SWAP. The remaining CNOT gate is controlled by a value created by doubling, thus can be removed. 
We are left with a chain of SWAP gates that rotate the significant bits and onto an ancilla, in particular, 
the value is rotated onto the ancilla (at which point all ancillae are cleared). The third optimization 
interprets the bit rotation at the end of the circuit as a relabeling of outputs. The resulting circuit in 
Figure [SJd works correctly only for x < M, but x < M can be guaranteed in Shor's algorithm. 

Since output relabeling cannot be used in a controlled 2x%M circuit, controlled rotation can be 
implemented with controUed-SWAP gates. However, when multiple 2x%A{ circuits are concatenated to 

® Circuits in the literature may exhibit quadratic size because, to clear ancillae, they implement x/2 %M by finding the 
modular inverse of 2 (see Section [2J and decomposing it in binary. 




(a) (b) 



Figure 8: Circuits for f{x) = 2x%21, (a) module-based design, (b) after three optimizations. The first 
and second sub-circuits in these figures are for {x > 10) and (a; > 10 ? a; — 11 : x), respectively. In (a), 
SWAP gates are used to compute 2x, the second-to-last CNOT adds 1 (i.e., 2x + 1), and the last CNOT 
clears the bottom ancilla. These gates can be removed by reordering output labels as shown in (b). 
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implement 2^%M , all controlled rotations can be merged into one such rotation at the end of the circuit. 



Modular reduction 2^x%M for M = 2*= ± 1 can be performed using a well-known algorithm. To 
compute x%{2^ — 1), add 2'^-ary digits of n-bit x modulo 2^ — 1. To compute x%{2'' + 1), alternate 
addition and subtraction of 2'^-ary digits of n-bit x modulo 2'^ + 1. When k > n, no gates are needed. 
Otherwise, one can use \n/k~\ Cuccaro adders on log2 M bits. In the case Af = 2'^ — 1, the output carry 
of each adder can be ignored. Hence, Toffoli and CNOT gate counts are \n/k] ■ 2k and \n/k~\ (4fc + 1), 
respectively. For A/ = 2*^ + 1, the output carry of each Cuccaro adder should be considered. In this 
case, at most \n/2k'] mod-M reduction modules on logj M bits are sufheient. Therefore, the numbers of 
Toffoh and CNOT gates are [n/Zc] (2fc+2)+ rn/2fc] (5fc-2) and \n/k']{4:k + 5)+\n/2k'\{ni + 5) (rii < fc + 1 
represents the number of non-zero bits in M), respectively. Another approach to implement the required 
additions and subtractions is to implement the counters (0, 1, 2'""±1 — 1) and (2'""± 1 — 1, 2'^±1 — 2, 0) 
conditional on bit values of n-bit x as illustrated in Figure [TU] for x%3. Clearly, no %M modules will 
be required in this approach. A factor of 2* only changes the indices of the bits read by the baseline 
algorithm. All circuits constructed here exhibit linear number of CNOT and Toffoli gates in terms of n. 

Special case 2^x%M where M = 2"— lid and both k and d are very small. Breaking down x into k more 
significant bits and n — k less significant bits, we write x = 2"~'^a;j!* -I- x'°_^. = 2'^~^{x/2"-~^) -\- a;%2"~'^ 
and then 

Cx%M = {TxI' + 2''x':_k + Ofe)%M 
= {xl'il ±d) + 2''x':_^)%M = ((2'=x^fe + x'i') ± dxt'')%M = (rotkix) ± dxi')%M (7) 

where rotfc(x) is a cyclic shift (rotation) of x by fc bits (rotfc(a;) may exceed M). Note that when d = 
or d = 2, we get a well-known special case described above. When \d\ < 2""*^' modular reduction can be 
computed by subtracting M if the number exceeds M, which allows one to compute the product dxj!' 
through a series of shifts, additions and subtractions. For larger values of d, we can write 2*^ = 2'"'i+'^= 
such that \d\ < 2"~'^i and \d\ < 2""'"'^, then compute 2'^x by multiplying by 2''^ and by 2''^ in separate 
steps. Another approach would let the i-th bit of xjj* control the modular addition of a precomputed 
constant d2^ % M, as shown in Section [21 

Division with remainder circuits convert x into x/p,x%p without loss or gain of information. A 
simple example is given by p = 2*^, where the quotient and the remainder are simply the n — k high and 
the k low bits of x. Previously, we have also shown linear-sized remainder circuits for 2" ± 1. In general, 
division can be performed by a series of subtractive modular reductions, whose ancillae accumulate the 
bits of the quotient. When x < 2'^p, the most significant bit is computed by a mod-2'^~^ p reduction, the 
next bit by a mod-2'"'^^/ci reduction, etc for a total of fc + 1 reductions. The last reduction produces the 
remainder. 

4 Cx%M using division with remainder 

We propose the following computation of Cx%M. 

Theorem 4.1 Consider integers < x < M, 1 < C < M with gcd(C, Af) = 1. Define p = \^~\ and 
S^C~ M%C. Then 

Cx%M = (S lx/p\ + C{x%p)j %M (8) 

Furthermore, when < M , 

5[x/p\+C{x7op) <2M (9) 
so that a single subtractive mod-M reduction suffices. 

Proof. Clearly, x ~ p\x/ p\ + x%p. Then Cx = Cp[x/p\ + C{x%p). We leave the latter term as is 
because C{x%p) < C{p - 1) = C[M/C\ < M. To reduce the former term, note that M < Cp < 2M, 
thus Cp%M = Cp - M. Substitute M = C[Af/CJ -I- M%C = C\M/C] - C + M%C to obtain 
Cp%M = C- M%C = (5, proving Formula [H 

Since 6 < C and [x/pj < Cx/M < C, we have (5[x/pJ < which proves FormulalHl | 
To construct reversible circuits using this result, use the circuits for division with remainder from 
Section[3]to represent x by the pair ( [x/ p\ , x%p) without a loss or gain of information. This will require 
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Figure 9: Circuits for (a) modular reduction with M = 12 (after optimization), and (b) computing 
f{x) = 3x for X < 16. In (b), gates in the dashed box clear ancillae. 



[log2 C] subtractive mod-2*p reductions, with the [log2C]-bit remainder stored in ancilla (for C = 3, 
two mod-/9 reductions are performed). A challenging part is to implement multiplications by constants 
6lx/p\ and Cx%p with reversible circuits, so that ancillae are cleared. This is illustrated in Section [3] 
for C = 2" ± 1. After modular addition, ancillae can be cleared by computing {Cx%M)%C, which takes 
linear time for C — 2" ± 1 as explained in Section [2] 

Example 4.1 This approach is illustrated in Figure \TU\ for 3x%35 where p ~ 12, (5 = 1. We implement 
[log2 3] = 2 subtractive mod-12 and mod-24 reductions by two successive %12 modules. Accordingly, the 
second-to-last and the last ancillae evaluate to 1 when 12 < x and 24 < x, respectively. After the first 
CNOT, the ancillae will be 1 for 12 < x < 24 and 24 < x. Therefore, the values of the lowest-placed 
ancillae are 0, 1 or 2 based on the value of x. Computing 3x (not modular) and adding 1 for 12 < x < 24 
or 2 for 24 < x (Formula\^ implement 3a::%35. To clear the ancillae used, one needs to implement %3 
on two new ancillae (two highest- placed ancilla bits in Figure [75)) and uses the bits to control two CNOT 
gates. Since 2%3 = 1, we can rewrite x = 2^x5 + 2^x4 + • • • + 2xi + xq %3 as — X5 + X4 — X3 + X2 — xi + xq 
%3 which can be implemented by three up-counters conditioned on even bits and three down-counters 
conditioned on odd bits. The %3 computation can be undone by applying %3 in reverse (indicated by the 
(%3)-i block in FiaureW\). 

Values C = 0(1) facilitate linear-time mod-p decomposition by subtractive reductions and also imply 
(5 = 0(1), x/p = 0(l). Therefore the first multiplication can be performed through controlled additions 
of constants. Given our circuits for (not modular) multiplication by {2^ + 1) in Section^ Formula[S]can 
be used with C = 5, 9, 17, 33, . . .. Circuits for multiplication by 5 = C — M%C are available for 5 of the 
form 2^ and 2*^ + 1. 

Working with C > \fM directly can be difficult because many modular reductions may be required 
in Formula m and their ancillae must be cleared. It helps to postpone, until the end, clearing the ancilla 
that contain x%p, and use them to clear the ancillae for modular reductions. Another trick is to avoid 
unnecessary modular reductions by interpreting each multiplication %M. In particular, large C values 
can be replaced by Af — C if the addition is replaced by subtraction. In this context, some large C 
values may also be convenient when 6 = 0(1) and p = 2*^^^-', and thus the second multiplication can be 
performed through controlled additions of constants. 

To count the number of Toffoli and CNOT gates for x ^ (x/ p,x%p), we use [log2 C] subtractive 
mod-2*p reductions and [log2 C] ancillae. The reductions go from larger numbers to smaller numbers, 
ending with p. A 2V-reduction module operates on [log2 p] + i bits. Hence, the number of Toffoli gates 
win be ^°i^Yiog.,(M/p)\ (5riog2 -7 + 5i) < riog2 C] (5riog2 A/] -7). To compute (2'^' + l)x%M by division 
with remainder, one additionally uses a multiplicative module Cx (not modular) to compute C{x%p). 
Consequently, one Cuccaro adder and one %M module are employed to add [x/pj to the result and 
apply the mod-M reduction. To clear ancillae by computing (Cx%M)%C, two %C modules and 0(1) 
gates are necessary. Additionally, [log2 C] and [log2 M~\ + 1 ancillae are required for the first modular 
reductions and other blocks, respectively. 

5 Circuit operators and decompositions 

Given small circuits proposed earlier, we find additional C values for which small Cx%M circuits exist. 
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Figure 10: A circuit for 3x%35 in Example 14.11 Circuits for 3a; and modular reduction with M = 
12 = [35/3] are illustrated in FigurclHl The two mod-12 reductions compute (x/12, a;%12), the latter is 
unconditionally multiplied by 3 (the 3a:; box) and the result is added to the former (the adder box). The 
%3 module modifies two highest-placed ancilla bits, counting up (0,1,2) and down (0,2,1). It consists of 
three count-up (t) and three count-down (|) blocks — the first block of each kind is shown in detail. The 
two CNOTs that target the last two lines clear ancillae set by %12 modules. The (%3)^^ module clears 
ancillae set by the %3 module. 



Table 1: Number of gates in circuit blocks. In this table, ni and n'l represent the number of non-zero 
bits in M and C and n = \\0g2 M~\ , and n' = [log2 C] . T, C, and A are the number of Toffoli, CNOT and 
ancillae, respectively. More accurate gate count for division with remainder is discussed in text. Further 
constant-specific optimizations may be possible. 
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5.1 Multiplicative decompositions 

We employ two circuit operators — inversion and negation — that convert a reversible circuit for C'x%M 
into a circuit for t(C)x%M , where the function r(-) characterizes the transform. 

Inversion reverses the order of the gates and replaces each gate with its inverse (inverters, CNOT 
gates, swaps and Toffoli gates are self- inverse). Circuit size is preserved. The generated circuit computes 
t{C)x%M, where t(C) is the mod-M inverse of C, i.e., t{C)C%M = 1. When gcd(C, M) = 1, modular 
inverse exists, is unique and can be computed by the extended Euclidean algorithm [5]. When applied to 
small-power-of-two circuits (C = 2''), inversion produces negative-power-of-two circuits (C = 2~'^%M) 
and generates new convenient C values unless Af = 2" — 1. 

Negation entails t(C) = M-C = -C%M. Note that -x%M = M-x = (2"-l)±d-a; = a;'±d, where 
' performs bitwise negation. Therefore, the circuit operators adds an inverter on every wire and performs 
one modular addition/subtraction with d, cither before or after modular multiplication by C . Circuit size 
increases. When M is odd, so are Af — 2*^, producing new convenient values. Combining negation with 
inversion may produce additional convenient values. Given that the two transforms commute, applying 
inversion and negation to small powers of two produces at most 4 [log2 Af ] convenient values (including 
small powers of two), which can be a lot smaller than ip{M). 



11 



Modular products. Composing compact circuits for convenient constants in series^ one can often 
obtain additional convenient constants C = CiC2%M . However, when multiplying small positive and 
negative powers of two, no new values can be obtained. Multiplying positive powers of two (or negative 
powers of two) does not help when i\/ = 2" — 1, e.g., for M = 15. Products with negated powers of two 
do not give new convenient values when M = 2" + 1, e.g., for AI = 33. In general, since (Z/MZ)^ is 
a product of two cyclic groups, it suffices to build compact reversible circuits for its two generators and 
compose them in various ways to produce reversible circuits for all other group elements. This strategy 
is impractical because (i) the composed circuits will often be larger than necessary, (m) it is not clear 
how to identify a pair of generators without knowing p and q. 

5.2 Additive decompositions and a shortest-path formahsm 

For large C, the multiplicative operators described above may be insufficient. To also consider additive 
operators, we introduce a zero-initialized ancilla register which is cleared after Cx%M is computed in 
the primary register. A value is copied into this register from the primary register using a parallel chain 
of CNOT gates. Multiplicative operators can be applied to individual registers, and additive operators 
replace the contents of one of the register with the modular sum or difference of two values (note that 
these operations are reversible). The operators we consider are listed in Table [5J along with their costs, 
measured as the number of T gates (which dominate quantum cost). We use the following circuit 
descriptions. 

• Every step/operator takes exactly two characters 

• Odd-numbered characters are operator types: c,~,+,-,d,h,r,t,v,f 

• Even-numbered characters arc register indices: 1 or 2. 

For example, the literal c2 represents a bit-wise CNOT operation with Register 2 as its target. It 
is meant to copy the contents of Register 1 to a zero- initialized Register 2 (or clear Register 2, when 
it duplicates Register 1). The same can be accomplished using the modular addition operator +2 (the 
modular subtraction operator -2, respectively), but at a higher cost. The circuits c2c2, +2-2 and rltl 
do nothing, and the circuit c2clc2 swaps the contents of the two registers. As a more complex example, 
to compute (x, 0) i— > (3.t%65, 0) without the multiplicative Sx%M operator we introduced earlier, one 
might use the circuit c2+l+l+2+2d2+2d2d2c2. It uses 154 T gates, and is smaller than our generic 
3x%M circuit. However, such compact circuits need to be discovered for each M. We reduce this task 
to finding a shortest path in a graph where the vertices represent possible two-register states relative to 
the initial state (x,0). For < a, 6 < M, vertex {a,b) represents {ax,bx). The source vertex is (1,0). 
The weighted edges represent operators from Table[5]with respective costs. When traversing this graph, 
vertices and edges can be generated on the fly. 

Theorem 5.1 For an n-bit value M and any < C < M coprime with M , the worst-case gate count 
of optimal two-register circuits mapping (a;, 0) i— > (Cx%M,x) and (a;,0) ^ (Ca;%A/, 0) is 0(n?). 

Proof. Once the statement for (a;, 0) {Cx%M,x) is proven, the statement for (a;, 0) i-> (Cx%M, 0) 
follows by Bennett's construction for clearing ancillae (Section II. 2[) which produces a circuit of the 
second kind by composing two circuits of the first kind. Consider the binary decomposition of a; = £^2*6^ 
and traverse it from the most significant bit. Before considering a new bit, apply the d2 operator, 
except when the second register holds value 0. Upon seeing bit 1, apply the +2 operator. For example, 
x = 13 = O&llOl leads to operators +2d2+2d2d2+2, which produce 

(a;,0) 1-^ {x,x) 1-^ {x,2x) i— > [x,^x) i— ^ (a;, 6a;) i— >■ (x, 12a;) i— >■ (a;, 13a;) 

To swap the register values, one can apply c2clc2 or clc2cl, but this may be unnecessary within 
Bennett's construction. Each operator uses 0{n) gates. The circuits use n — 1 d2 operators and up to 
n +2 operators, thus 0{n^) gates total. | 
The upper bound on the T-cost of (a:, 0) i— > {Cx%M,x) circuits implied by our proof is n(5n — 
7) -f 2n^ = 8n^ — 7n, with the average-case estimate n{hn — 7) + = 7n^ — In because half of the 
bits are on average. These bounds can be improved by considering the canonical signed digit (CSD) 
decomposition, which uses not only additions but also subtractions, and ensures that at least one of 
each two neighboring bits is a 0. Thus 7n^ + 0{n) becomes a worst-case bound, and the average case 
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Table 2: Circuit operators for two n-bit registers and their costs in terms of the number of T gates. Note 
that for each operator, its inverse is also listed in the table and has the same cost. 
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Figure 11: Cx%M circuit costs for select M values, shown as cumulative distribution functions. 

improves to 6|n^ + 0{n). For (.t,0) i-> (Ca;%A/, 0) circuits, doubling the above estimates due to the 
use of Bennett's construction produces lAn? + 0{n) in the worst case and 13^n^ + 0{n) on average. 
The smallest-cost circuits we report in Section 16.21 improve upon these bounds by factors 2-4, but not 
asymptotically. We also note that our shortest-path construction produces 0(n)-sized circuits in some 
basic cases, such as C = 2,3,4, ... In contrast, resorting to Bennett's construction with binary or CSD 
expansion involves the modular inverse of C and typically leads to n^-sizcd circuits. 

6 Examples of modular multiplication 

Here we study M = pq for small prime p and q. One can argue that large classes of such M values 
should be excluded from consideration in the context of Shor's algorithm because they offer no value for 
number-factoring. For example, numbers of the form M = b'^^ — 1 can be factorized quickly by computing 
gcd{M,b^ ± 1), and this class includes the number 15, commonly used in experimental demonstrations 
of Shor's algorithm. The same argument applies to some numbers that satisfy mM = b'^'^ — 1, where 
m has very few factors. This class includes the number 21, considered as the next example after 15 for 
quantum number-factoring. Indeed, 3-21 = 8^ — 1 = 7- 9 leads to gcd(7, 21) = 7. Nevertheless, we 
consider these cases for completeness and use them to illustrate general circuit constructions Q 

^This does not justify the use of 15 and 21 in physical experiments where scalability must be demonstrated. 
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6.1 Very small moduli 

Table [3] describes small circuits for Cx%M functions with coprime C and M with 6 bits or less. Each 
circuit is described by a parenthesized triplet consisting of the TofFoli gate count, the CNOT gate count 
and the number of ancillac. An expression indicating circuit structure follows after a colon. C values 
where gcd(C, M) > 1 and C > M are marked by x and — , respectively. For each M, the last row reports 
circuits constructed by binary expansion of x (Section II. 2p with the smallest gate counts among different 
C values. 

In each case, we report the best circuit structure we could find. For example, 3a;%35 can be imple- 
mented as —2^. At most one inverter may be used on each circuit line. Of the techniques we presented, 
the most economical one is the use of 2x%M circuits, their repetitions, inverses and negations. In some 
cases (M = 21,39,55), it suffices for all C values. When additional circuit constructions are needed, 
we start with circuits for 3x%M or 5x%M , except when the modulus is divisible by 3 or 5. By means 
of circuit operators, these additional primitive circuits generate a large number of composite circuits, 
especially that they can be composed with powers of two, etc. In Table |31 the first grayed cell of a 
column represents a primitive circuit that is not a power of two. The smallest circuits constructed by 
binary expansion of x for each M (shown in the bottom row) are typically larger than the largest circuits 
proposed. The data suggest that divisibility of M by 3 can lead to relatively large Cx%M circuits 
compared to other moduli M with the same number of bits. This is because C = 3 is the smallest C 
value unrelated to powers of two, for which we can build compact multiplication circuits. Among Af 
values divisible by 3, circuits for M = 39 tend to be smaller because all C values coprime with M can 
be obtained through positive and negative powers of two, and their inverses. 

6.2 Larger moduli 

We now illustrate the use of our shortest-path reduction to find two-register mod-mult circuits. Our 
CH — h implementation of Dijkstra's algorithm operates on an M x M vertex array, but generates edges 
on the fly. In one pass, it finds all single-source shortest paths starting at (1,0) and produces Cx%M 
circuits for all C coprime with M (this is convenient, but not necessary when working with Shor's 
algorithm). The modular multiplication circuits with 7-14 bits produced by our techniques are available 
online at http://www.eecs.uinich.edu/~imarkov/MME/. In Tabled we show circuits for Cx%65 with 
all coprime C. Figure [TT1 shows circuit-cost distributions (for Tgate counts) for several M values in terms 
of cumulative distribution functions (CDF). Maximum and average costs for all 6-14 bit semiprime M 
values not divisible by 2 and 3 are reported in Table [51 On a fast Linux workstation (3.0GIIz Intel 
CPU with 8GB RAM), processing one 14-bit M value takes one to six minutes, and less than three days 
for all 14-bit M valuesO Many 15-bit values require over 8GB memory, and runtime increases four- to 
eight-fold. Our implementation of Dijkstra's algorithm based on an explicit 2" x 2" vertex array does 
not scale beyond 15-bit M. However, the shortest-path formalism can be applied in different ways to 
find optimal circuits for larger M values, and also to perform heuristic optimization for much larger M 
values. 

The sizes of n-bit modular multiplication circuits in Table [5] fit very well (i?2 > 0.999) to quadratic 
functions, producing the worst-case bound 6n^ -I- 0{n) and the average-case estimate 3.3n^ + 0{n). 
Thus, our circuits are 4 times smaller on average than CSD-based circuits produced using Bennett's 
construction discussed after Theorem 15. II 

7 Circuits for modular exponentiation 

When implementing /(y) = 6^%M, one deals with modular multiplications CkX%M, Ck = %M 
conditional on bits yk, as outlined in Section [TJ 

7.1 Reordering and factoring of modular multiplications 

The order of conditional modular multiplications does not affect the result, and this becomes useful after 
some of them are factored. As we have shown earlier, sometimes CkX%M is easiest to implement as 

^We report timing for an implementation of Dijkstra's algorithm that uses a comparison-based priority queue from CH — h 
STL. Given that all path lengths are integers below 10000, we have also implemented an 0(l)-time bin-based priority- 
queue. Runtime improvements were significant for smaller n, but memory usage increased somewhat. Since memory is the 
main bottleneck for larger n, we decided to use the more compact comparison-based priority queue. 
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Table 3: The structure and gate count of the proposed circuits for Cx%M with coprinie C and M with 
6 bits or less. Each circuit is described by a parenthesized triplet consisting of the Toffoli gate count, 
the CNOT gate count and the number of ancillae. DR indicates direct use of division with remainder 
f Theorem 14. ip . Circuits for M = 15 are illustrated in Figure [5] Gray cells indicate circuits that do 
not only use powers of two, negative powers of two, or their inverses. All ancillae are cleared after 
the computation. Additional optimizations are possible. For each M, the last row reports circuits, 
constructed by binary expansion of x, with the smallest gate counts among different C values. 



c 


21 = 3 ■ 7 (12) 


33 = 3 ■ 11 (20) 


35 = 5 ■ 7 (24) 


A/ = p ■ g ( {!./ M 
39 = 3 - 13 (24) 


Z)>< 1) 

51 = 3 ■ 17 (32) 


55 = 5 ■ 11 (40) 


57 = 3 ■ 19 (36) 


2 


(15,16,4): 2^ 


(22,24,5): 2^ 


(17,20,4): 2^ 


(12,19,3): 2^ 


(17,19,4): 2-'- 


(12,18,3): 2^ 


(22,19,5): 2^ 


3 


X 


X 


(65,39,6): DR 


X 


X 


(85,127,3): -2-'^ 


X 


4 


(30,32,4): 2^ 


(44,48,5): 2^ 


(34.40.4): 2^ 


(24,38,3); 2^ 


(34,38,4): 2^ 


(24,36,3): 2^ 


(44,38,5); 2^ 


5 


(33,34,4): -2~^ 


(123,51,7): DR 


X 


(36,57,3): 2~^ 


(146,62,7): DR 




(138,76,8): DR 


6 


X 




(82,59,6): 2 ■ 3 


X 




(73,109,3): -2~^ 




7 


X 


(145,75,7): 2/5 




(61,98,3): -2^ 


(197,119,7): 5/8 


(36,54,3): 2"'^ 


(71,60,5): -2~^ 


8 


(45,48,4): 2'^ 


(49,51,5): -2~^ 


(51,60,4): 2-^ 


(36,57,3): 2^ 


(51,57,4): 2-^ 


(36,54,3): 2^ 


(66,57,5); 2^ 


9 


X 


X 


(34,40,4): 2~'^ 


X 


X 


(72,108,3): 2"^ 


X 


10 


(18,18,4): -2~^ 


(145,75,7): 2 - 5 


X 


(24,38,3): 2"^ 


(149,64,7): -5~^ 


X 


(160,95,8): 2 ■ 5 


11 


(15,16,4): 2~^ 




(68,80,4): 2""^ 


(60,95,3): 2~^ 


(180,100,7): 4/5 




(165,98,8): -2/5 








(65 ,39,6) : 3 — "'" 






(61,91,3): -2~~^ 






(48 50,4)' -2'^ 


(128,54,7): -5 — 


(54 62 4) ■ -2 ~ 




(34 ,38 4) ■ 2 ~ 


(108, 162, 3)- 2~''* 


(187,117,8); -5/4 






{_J.DU,fO,(j; -O/ji 




^fo,XJ./,iSy: - z 


{ 1 si(\ inn '7\- K / A 
l^ioU , lUU , / J : O/t 


f A O — 

(i'i, oD, o y : 


i AC\ A 1 \ • O 

(^■iU , -i-L , o J : -z 


15 


X 




^ 


X 




X 




16 


(30,32,4): 2~ ^ 


(27,27,5): -2~ ^ 


(68,80,4): 2** 


(48,76,3): 2"* 


(68,76,4): 2^ 


(48,72,3): 2^ 


(88,76,5): 2^ 


17 


(33,34,4): -2"^ 


(22 ,24,5): 2~^ 


(20,22,4): -2""'" 


(49,79,3): -2~^ 




(108,162,3): 2^^ 


(165,98,8): ~^ 








/'IT on A \ ' o — 1 






/'Q'l TOR Q'^. O** 
(041: , liD ,0 ) . -d. 


^ 




f T a T a A \ • ol 


/ t A '7K '7\ • 
t^l^lD , 1 , 1 ) . O / Z 


< ^7 \ A \ • o4 




rg 4^ ^^^T^ 


^rJ/.l^O.ijj. -.i 




— 


(3,2,1). -1 


/10^ t^T '7\ ' K — 1 

t^±^o,0-L)fj: o 


^- 


/"TO 1Q O — ■! 

^±z,iy,oj: .i 






/1QO 11/1 Q^. /I ^1 


21 








X 




(121,181,3); -2""-'-'^ 




22 






(51.60,4): 2 


(48.76,3): 2^"^ 


(197,119,7): 8/5 


= 


{1ST, 117, 8): -4/5 







^rr- 

(150,78,7): 


(68 ,41,6): -3 "'" 


(49 , 79 , 3) : -2"^ 


(166,83,7): — 5/2 


(61,91,3): -2 


(138,76,8): 5 ^ ^ 


24 






(71,82,4): -2~^ 


X 




(49,73,3): -2"^ 




25 




(44,48,5): 2~~ 




(72,114,3); 2^ 


(20,21,4): -2~^ 


X 


(88,76,5): 2^^ 


26 




(150,78,7): -2/5 


(37,42,4): -2~'^ 




(17,19,4): 2~ ^ 


(96,144,3): 2^"" 


(165,98,8): -5/2 | 


27 






(54,62,4); -2^ 






(13,19,3): -2^-^ 




28 




(128,54,7): -5 




(61,98,3): -2-^^ 


(163,81,7): 5/2 


(12,18,3): 2"^ 


(27,22,5): -2^-^ 


29 




(49,51,5): -2^ 


(85,61,6): -6 


(25,41,3): -2""^ 


(200,121,7): -8/5 


(97,145,3): -2~" 


(22,19,5): 2^-^ 


30 








X 




X 




31 




(27,27,5): -2^ 


(37,42.4): -2^ 


(37,60,3): -2-^ 


(163,81,7): 2/5 


(48,72,3): 2~'^ 


(160,95,8): 5/2 


32 




(5,4,2): -1 


(68,41,4): -3 


(60,95,3): 2^^ 


(51,57,4); 2-*^ 


(60,90,3): 2^^ 


(9,3,79,5): -2^* 


33 






(20,22,4): -2^ 


X 








34 






(3,2,1); -1 


(37,60,3): -2~^ 




(120,180,3): 2~-'-" 


(143,79,8): -5~1 


35 








(25,41,3): -2^ 


(71,78,4); -2* 




(182,114,8): 4/5 


36 








X 




(96,144,3); 2^ 




37 








(13,22,3): -2^ 


(183,102,7): -5/4 


(85,127,3); -2' 


(187,117,8): -4 ■ 5 


38 








(1,3,1): -1 


(37.40,4); -2~^ 


(109,163,3); -2'-^ 




39 












(49,73,3); -2'* 




40 










(183,102,7): -4/5 




(160,95,8):^!^ 


41 










(146,62,7): 5~^ 


(25,37,3): -2*"^ 


(93,79,5): -2* 


42 












(109.163,3); -2"'-^ 




43 










(54,59,4); -2'^ 


(60,90,3): 2"^^ 


(44,38,5): 2^^ 


44 










(200,121,7): -5/8 




(182,114,8): 5/4 ^ 


45 
















46 










(149,64,7): -5 J 


(73,109,3); -2*^ 


(160,95,8): 2/5 


47 










(37,40,4); -2^ 


(37,55,3); -2^ 


(165,98,8): -2 ■ 5 


48 












(37,55,3): -2^"^ 




49 










(20,21,4); -2^ 


(72,108,3): -2~^ 


(71.60,5): -2-^ 


50 










(3,2,1); -1 


X 


(66,57,5): 2^-^ 


51 












(25,37,3): -2^ 




52 












(84,126,3): 2~ ' 


(143,79,8): -5 H 


53 












(13,19,3); -2^ 


(49,41,5): -2^ 


54 












(1,1,0); -1 




55 














(27,22,5): -2^ 


56 














(5,3,2): -1 



Smallest Cx'VoAI circuits based on binary expansion of x (with cleared ancillae) 



I 2 I (136,34.11) I (225,51.14) | (241,47,14) | (225,41.14) | (216,36.13) | (202.33,13) | (202,37,13) 
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Table 4: Two-register circuits for Cx%Q5. Cost is reported as the number of T gates. 



c 






(J 




Circuit 


2 


28 


dl 


33 


28 


hi 


o 


1 


i-Q-LO-LO-ui -LI A^ -LI A^ rn /-9 

CZ-i-Z-i-Z-|-l-|-lQl-|-lQiQiCjl 




140 


<^9_L9_l-9_l-1_l-1/-11_L1^^1 9<-9 
CZ-f- Z-\- Z-\- lCl 1.(1 1.- ZCZ 




OD 


dldl 


oo 


126 


<^9_L9_L9_L1_L9_L1_L1^^1 9<^9 
CZ-\~ Z-\- Z-\- l-y- Z-\- \.iX 1.- ZCZ 


6 


140 


cziiini-±-z-±-i-z-icz 


O ( 


140 


.-■9_L1 <^1 _l_9_l_1 _I_9<-I9_L1 _L9 


( 


140 


CZ-\- Zll i.Liz-\- Z 




126 


<-9_L1_L9_L1_L1_L9_L1H1 9<^9 
CZ-\- l-y- Z-\- l-y- Z-\- liX 1.- ZCZ 


Q 

O 




dldldl 


41 


126 


r-9_L1 _L1 _L9_L9_L1 _L1 _L9_L1 _L9 


Q 


140 


<-9-J-9J-1 Vil -LI _L9_L1 1 O 1 O 1 O 

cz-T-z-T- i.n.i-\- i-f-z-f- i.-\-z-\-z-\-z 


42 


140 


^■9_L1 _L1 _L9Vi1 _L1 _L1 _L9_L9_L9 
CZ-|-i-|-i-|-Zili-|-±-|-±-|-Z-|-Z-(-Z 


1 1 


140 


^■9_L1 _L9_L1 _L1 _LQ_L1 .-11 /^1 ,-9 




IDo 


CZ-\- l-y- Z-\- l-y- Z(1Z- 1-ZQZCZQl 


12 


126 


f9_L1ViO 9 19 1 Of^O 
CZ-\- LflZ-Z- L-Z-Z,- 1-ZCZ 


44 


140 


r-9_L9Vi1 IVil 1 1 9.-9 
CZ-|-ZIli-ini-i-l-ZCZ 


14 


140 


<-9_L9,-19_t- 1 _L9_L1 l-i9 1 9 1 9 




126 


^■9_L1 _L9_L1 _L1 _L9_L9_L1 J-1 _t-9 
CZ-\- l-y- Z-\- l-y- Z-\- Z-\- l-Y i-~v Z 






~ Ihlhl 




126 


r'9_L9_L9_L1 _L9_L1 _L9_L1 _L1 _L9 
CZ-\- Z-\- Z-\- \.-\- Z-\- \.-\- Z-\- Z 


17 


140 


c2+l + 1+2+2+ l + l+2d2+2 


48 


140 


C2+2+2+1 + 1+2+2+ ldl+2 


18 


126 


C2+1 + 1+2+1+2+1+2+2+2 


49 


56 


hlhl 


19 


126 


C2+2+1+2+2+1 + 1+2+2+2 


51 


140 


c2+lh2+2+l+2d2+l+2 


21 


154 


c2+2+lhl+l+l+l+l+2+l+2 


53 


126 


c2+2+l+2hl-l-2-l-2c2 


22 


154 


c2hlhl-l-l-l-l-l-2-lc2 


54 


140 


c2+l+2+l+2d2-l-2d2c2 


23 


140 


c2+2+2+lh2+2+2+l + l+2 


56 


126 


c2+lh2-2-2-l-2-l-lc2 


24 


126 


C2+2+2+1+1+2+2+1+2+2 


57 


84 


hlhlhl 


27 


126 


c2+l+2+l+2d2-l-2-lc2 


58 


140 


c2+lh2+2+l+2hl + l+2 


28 


140 


c2+2d2+l+2+ldl+2+2 


59 


140 


c2h2+2+lh2-2-l-2-lc2 


29 


140 


c2+2+2+l+2+l+2+ld2+2 


61 


70 


"Idldl 


31 


154 


c2+2+2+l + ldl+2+ld2+2 


62 


168 


c2h2+2+lh2-2-l-2-lc2hl 


32 


42 


-Ihl 


63 


42 


~ldl 








64 


14 


~1 
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26 
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Figure 12: Reordering and factoring in modular exponentiation for b ~ 2, M — 55. The initial circuit 
(left). Merging conditional — x%55 operations into one block conditional on the XOR of related controls 
based on Cx%M blocks from Table [3] (right). The correspondence between blocks in the two circuits can 
be established by matching control lines. 



— {M — Ck)x%M. In this case, we can factor out —x%M conditional on y^. Any number of conditional 
—x%M operations can be consolidated into one such operation, conditional on the XOR of relevant 
control bits. This XOR value can be computed using a chain of CNOTs without ancillae, and uncomputed 
by the same chain after use. Figure [T^] illustrates these optimizations for 6 = 2 and M = 55. Modular 
exponentiation with base 2 and AI = 55, requires conditional multiplications by 2, 4, 16, 36 = 16- 16%55, 
31 = 36 • 36%55, and 26 = 31 • 31%55. 

Reordering also allows one to move a small set of the most difficult multiplications to the front of the 
circuit, where the initial value is 1 and generic multiplication circuits can be avoided, as shown below. 

7.2 /c-bit look-up tables 

A (fc,m) look-up table (LUT) takes k read-only input bits and m > logj k zero-initialized ancillae. For 
each 2*^ input combination, a LUT produces a pre-determined m-bit value. For example, a (2,4)-LUT 
may be defined by values (1,2,4,8) or (1,4,1,4). 

Look-up tables arise in implementations of Shor's algorithm (with initialized bits) where the first 
conditional modular multiplication is applied to the constant 1, and can therefore produce only two 
possible values — 1 and the multiplier. Such a circuit can be implemented with at most m CNOT gates 
(m/2 on average). When two conditional multiplications are considered, four output combinations are 
possible. For every bit of the result, this defines a two-input Boolean function, which can be implemented 
with at most two reversible gates (possibly with negative controls) and no ancillae. All these gates operate 
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Table 5: Costs of two-register circuits over n-bit semiprime M values not divisible by 2 and 3. 



^ of BGmiprimcs 
[smallest, largest] 



Circuits with i 
C M 



: costs 
circuit 



. [65, 119] 



109 
112 



107 
109 
113 
114 
116 



115 
115 
115 
115 
115 
115 
119 
119 
119 
119 
119 
119 
119 
119 
119 
119 



c2+l + l + 2 + 2d2d2d2d2+2 

c2+l + 2 + 2 + l + 2 + ld2d2d2-|-2 

c2hlhlhlhl+ 1 + 1 + 2 + 2+2 

c2h2h2h2h2+ 2 + 2+ 1 + 1 + 2 

c2+l + 2+2 + l + 2hl+l + l + l + l + 2 + 2 

c2+2 + 2+l + ldldldldl + 2 

c2+l + 2h2h2h2-2-2-l-2c2~l 

c2+ 1 + 2+1 + Id ldl-2-ld2c2"'l 

c2+ldl+ldldldl-2-2c2~l 

c2+2 + 2d21ild2-ld2-lc2~l 



c2+ 1+2+ 1+1+2+1+2+1+1+1+2+2+2 
c2+l + 2+lli2+2 + l + 2 + l + l + 2+l + 2 
c2+2 + l + 2 + l+ldldl + ldl + 2 
c2 + 2 + 2hlhlhl-lhl-lc2~l 



16 in [133, 253] 



253 
253 



c2+2 + l + 2 + l+l + l + 2 + l + 2 + l + 2 + ldl + 2+2 
c2 + 2hl + l + 2+l + 2 + l + 2+l + l+l + 2+l + 2 + 2 



34 in [259, 511] 



431 
476 



36 

459 

469 

477 

487 

491 

494 



485 
485 
505 
505 
505 
505 
505 
505 
505 
505 
505 
505 



c2+2hl- 1-1-2- 1-2-2- 1-1-2- 1-2- 1-2-2- lc2 

c2+l + 2 + 2 + l + 2 + l + 2 + l+l + 2+2 + l + 2 + l + ldl-2c2 

c2+l + 2 + 2 + l+l + 2 + 2 + l+lh2 + 2 + 2+2 + l + l + 2+2 



c2+2 + l+l + 2+2 + 2d2 + l + l + 2+2 + l + l + 2+2 + l + 2 
C2+2 + 1+1 + 2+2 + 1 



c2+2 + l + 2 + l + 2 + l + 2 + 2hl + l+l + l + l + 2+2 + l + 2 



[515, 1007] 



935 1007 c2+l + l + ldl+ldlh2h2dl-2dl-2-2c2~ 

951 1007 c2+l + ldl + lh2dlh2dl-2dl-2-2-2c2~l 

971 1007 c2+l + l + ldl+ldlh2h2dl-2h2-2-2c2~ 

979 1007 c2 + 2 + 2d2 + 2d2d2hlhl-lhl-l-l-lc2"'l 

989 1007 c2+2 + 2+21il + 2hlhllild2-ld2-l-lc2~ 

993 1007 c2+2 + 2hl + 2hlhlhld2-ld2-l-l-lc2"'l 



152 in [1027, 2047] 



292 2045 
203S 2045 



c2+l + 2 + l + l + 2 + 2 + 2 + 2d2 + l + l + 2+l + 2 + l + l+ldldl-2c2 
c2+2hlhl-l-l-l-2-l-2-l-lh2-2-2-2-2-l-l-2-lc2 



299 in [2051, 4087] 



2229 3901 
3894 3901 



c2+l + l + ldldl + ldl + 2+ld2 + l + 2d2 + ld2d2+l + 2 
c2+2hl + 2hl-l-2-lh2h2 + 2h2+2h2h2 + 2+2 + 2 + 2 



621 in 
[4097, 8189] 



6347 7405 c2hlhl + Ihlhl + lh Ihlhl-lhl- 1-1- 1-2-1- l-2c2~ 1 

7398 7405 c2 + 2 + l + 1 + 2+ 1 + 1 + ldl+ ldldldl-ldldl-ldldlc2~ 1 

1060 7421 c2hlhl+llilhlhl+llilhlhl-l-l-l-l-2-2-2-lc2~l 

7414 7421 c2 + 2 + l+l + l + 2 + 2 + 2 + 2d2d2d2-2d2d2d2-2d2d2c2-l 

5352 7493 c2+ 1 + 1+ ldldl + ldldl-ldldl-2-2-l-ldl + 2dl + 2 

7486 7493 c2+l + l + 2 + 2hl-lhlhlhl+ lhl + lhlhlhl + 1 + 1+ 1 + 2 

2163 7571 c2+l + l+ldldl + ldldldl-ldl + 2 + 2+l + 2 + ld2d2 + 2 

7564 7571 c2hl + 2hl-l-l-2-l-2h2h2h2h2 + 2h2h2+2 + 2+2 + 2 

2211 7739 c2+ 1 + l+ldldl + Id Idl + ldld 1 + 2+2 + 1 + 2 + ld2d2+2 

7732 7739 c2+2 + 2+ 1 + 2+ lhlhlhlhl + lhlhl+ lhlhl + 1 + 1 + 1+2 

2223 7781 c2+2 + 2 + 2d2d2 + 2d2 + 2d2d2-2d2d2d2-l-l-2-2c2~ 1 

7774 7781 c2+2 + 2+ l + lh2h2h2-2h2-2h2h2-2h2h2-2-2-2c2~ 1 

6839 7979 c2hlhl-lhlhlhlhl + lhl+ lhl + 1 + 1+ 1 + 2 + 1 + 1 + 2 + 2 

7972 7979 c2+2 + l+ 1 + 2+ 1 + 1+ ldl+ ldl+ ldldldldl-ldldl + 2 

6959 8119 c2hlhlhlhl + lhlhl + lhl + lhl + l + l + l + 2 + l + l + 2+2 

8112 8119 c2+2 + l+l + 2+l + l+l + ldldldl-ldl-ldldldldl+2 

4076 8159 c2+ 2 + 2+ 1 + 2+ ldl+ Idld 1 + Id 1 + Id Idl+ldld 1 + 2 + 2 

4662 8159 c2+2hlhl + lhlhl + lhl + lhlhl + lhl + 1 + 2 + 1 + 2 + 2 + 2 



1212 in 
[8197, 16379] 



2020 14141 c2+l + 2 + l + 2+l + l + ldl + 2 + l + 2d2+l + 2d2 + l+l + 2d2d2 + 2d2+l + 2 

14134 14141 c2 + 2 + 2hl + 2hlhl + lhlhlhl-lhlhld2hl + l + l+l + 2 

12143 14167 c2hlhlhlhl + lhl + lhlhlhlhl + l+l + l + 2hl+l + l + 2 

14160 14167 c2 + 1 + 1 + 2 + 2 + ldl + ldldldldl + ldl + ldldldldl + 2 

12467 14545 c2+2 + 2 + 2d2d2d2d2 + 2d2d2d2d2-2hl-ld2-l-lc2~ 1 

14538 14545 c2+ l + lh2 + lh2 + 2h2h2h2h2-2h2h2dlh2-2-2-2c2~ 1 

12503 14587 c2+2 + 2+ 1 + 2+ 1 + 1+ l + ldldldl + 2+ l + 2d2 + l + 2d2+ l + 2+l + ldl + 2 

14580 14587 c2hl + l+ 1 + 2+ lh2+ 2 + lh2 + 2+ l + 2hlhlhl + 1 + 1 + 1+ 1 + 2 + 1 + 2 + 2 + 2 

12755 14881 c2+ 1 + 2 + 1 + 2+2 + 2 + 2 + 2+2d2d2d2d2 + 1 + 2 + Id 1-2-2- l-l-2dl-2c2 

14874 14881 c2+2 + 2hl-l-2-l-2-l-l-2h2h2h2h2h2-2-2-2-2-2-2-l-2-lc2 

12863 15007 c2+ 1 + 2 + 1 + 2+ 1 + 2 + 2d2d2d2d2- 1-2- 1-2- 1-2- 1-2- ldl-2dlc2"' 1 

15000 15007 c2+2 + 2hl + 2hlhlhlhlhl-lhl-lhld2hl-l-l-lc2~l 

13151 15343 c2+2 + l + 2 + l + 2 + l + ldldldldldl + 2 + l + 2 + l + 2 + 2 + 2 + l+ldl + 2 + 2 

15336 15343 c2+2hl+ 1 + 1 + 2 + 2+ 2 + 1 + 2 + 1 + 21ilhlhlhlhl+ 1 + 1 + 2 + 1 + 2+ 1 + 2 + 2 

13259 15469 c2+ 1 + 2+ 1 + 2+ 1 + 1+ 1 + 2+1 + 2d2- 1-1-2- Idl- Id ldl-2dl-2-ldlc2 

15462 15469 c2hl + l + 2hl + 2hlhl + lhl + l + 2 + l + lh2-2- 1-2- 1-1- 1-2-1-2- lc2 

2236 15653 c2+ 1 + 2+ 1 + 2+ 1 + 2 + 2d2d2 + l+ 1 + 1 + 2 + Id ldl-2- 1-2- l-2-2d2d2c2 

15646 15653 c2+ 1 + 2+2 + 1+ 1 + 2+2 + 1+ 1 + 2+ 2 + 2d2d2d2 + 1 + 2 + 1 + 2 + 2 + ldldldlc2 

13463 15707 c2+ 1 + 2+ 1 + 2+ 1 + 2+ 2d2d2d2+ 1 + 2+ ldl + 2 + 2+2 + 1 + ldl-2- Id l-2c2 

15700 15707 c2+lh2 + 2 + lh2-2-2-l-l-lh2-2-l-2hlhlhl- 1-1-2- 1-2- l-2c2 

15756 15763 c2+ 2hlhl + 1 + 1 + 1 + 2 + 2+1 + 2 + 2hlhlhlhl + 1 + 1 + 1 + 1 + 2 + 2+ 1 + 2 + 2 

2254 15779 c2hlhl+ lhlhlhlhlhl + lhlhl + lhl + 1 + 2 + 1+2 + 2+2 

15772 15779 c2+2 + 2 + l + 2+ldl+ ldldl + ldldldldldl + ldldl + 2 

2260 15821 c2+2 + l + 2 + 2+l + l+l + ldldldldl + 2 + l + 2 + l + l + 2+2 + 2d2d2+l + 2 

15814 15821 c2+lh2h2 + 2+2 + 2+l + l + 2 + l + 2hlhlhlhl + l+l + l+l + 2 + 2+l + 2 + 2 

14079 15839 "Itltl 

15830 15839 "Irlrl 

2272 15905 c2+l + l+ ldl+ Idld Idl- Id Idld ldlh2 + 2d 1 + 2 + 2 + 2 

15S9S 15905 c2+2 + 2hl + 2hlhlhlhlhl-lhld2hl + lhl+l + l+l + 2 

13655 15931 c2+ 1 + 2+ l + 2+2 + 2 + 2 + 2+2d2 + 2d2d2d2d2 + l+ 1 + 2+ 1 + 2+ ldl-2-2c2 

15924 15931 c2+2 + l + 2hl-l-l-2-l-l-lh2h2h2h2-2h2-2-2-2-2-l-2-l-lc2 

2280 15961 c2h2h2h2h2h2-2h2h2-2h2h2-2h2-2-l-2-l-lc2^1 

15954 15961 2 + 2+2 + l + 2 + ldl + ldldl + ldldl + ldldldldldlc2~ 1 

7 15989 c2+2 + 2 + l + 2+ldl+ldldl + ldldl + ldldldl+ldldlc2 

2284 15989 c2+ 1 + 2+ 1 + 1+2 + 2+ 2 + 2d2d2+ l + l + 2d2+ 1 + 1+ 1 + 1 + 2 + Idld ldl-2c2 

13705 15989 c2+ 1 + 1 + 2 + 2+2 + l + 2 + 2d2d2d2d2- 1-1-2- 1-2- 1-2- l-2d2-ld2c2 

15982 15989 c2+ lh2h2-2-2-2-l-2-l-2-2-lhlhlhlhl-l-l-l-l-2-2-l-2c2 

4584 16045 c2+2 + 2+ l + 2+l + 2hlhlhlhl+ lhlhl + l + 2 + l + l + 2 + l + l + 2 + 2+ 1 + 2 

4652 16283 c2+2 + 2hllil + 2hl-l-l-2-lh2h2-2-2-2-2-l-l-2-lh2-2-2-lc2 

8138 16283 c2+2 + 2 + 2 + 2+ 1 + 2+ l + l + 2 + 2 + 2 + 2d2d2+ 1 + 2+ l + ldl-2dldl-2dlc2 

14015 16351 c2+2 + 2 + 2d2+2d2d2d2+2d2d2d2d2hl-ld2-l-lc2~l 

16344 16351 c2+ l + lh2 + lh2h2h2h2h2-2h2h2h2-2dl-2-2-2c2"' 1 



Trend lirrcs — 



Max(Ti) = 5.309n 



■ 11.59T^ + 4.5 



Avg(7i)=3.35l7i^ + 7.12771 



- 78.; 



Extrapolated values — 20: (1896,1404); 50: (12697,8655); 100: (51935,34144); 200: (210046,135386); 300: (474337,303649) 
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(a) (b) 

Figure 13: Implementation of conditional modular multiplications by 16, 36 and 31 in modular expo- 
nentiation for 6 = 2, M = 55 as a (3,6)-LUT. (a) A straightforward implementation, (b) an optimized 
circuit. The gates (711,512 and 513,(714 use 2/4 (i.e., X1X3 © ^1^3) a-s a control. Applying the last two 
CNOT gates in (b) equals to applying 515 and 516 in (a). In (b) g is added to clear 1/5 used to simplify 

57,58- 



in parallel, although most existing technologies are not able to use such amount of parallelism. When two 
output bits implement the same function using Toffoli gates, one of them can be replaced by a CNOT 
that copies the computed value. 

Reconsider conditional mod-mults by 16, 36 and 31 required in modular exponentiation for b ~ 2, 
M — 55 as shown in Figure [T2l Depending on the three input bits, the output may be 1, 16, 36, 31, 
26 = 16 • 36%55, 1 ^ 16 • 31%55, 16 = 31 • 36%55, 36 = 16 • 31 • 36%55. Figure [HJi illustrates a simple 
realization based on the following Boolean expressions for output variables where j} is used to 
denote rrii © rrij for minterm^ and rrij . 

yo = 0{O, 3, 5}, yi = 0{3, 4}, j/2 = 0{2, 3, 7}, ys = 0{3, 4}, = 0{1, 3, 4, 6}, j/s = 0{2, 7} 

Since some Boolean functions with > 2 gates repeat, they can be computed once and then copied. Some 
Boolean functions can be used to compute other Boolean functions too. Following these optimizations, 
an improved circuit for circuit in Figure I13f a) is shown in Figure I13f b) which is smaller than three 
conditional modular multiplications by 16, 36 and 31 as reported in Table [31 Figure [T3] illustrates 
the LUT- implementation of modular exponentiation for AI ~ 21 with different coprime base values. 
Predictably, the cases with b'^%M = 1 result in the most compact circuits F"l 

Systematic synthesis. We now construct circuits to implement each output of a reversible LUT. 
Viewing each output as a Boolean function of read-only inputs, one can write the Shannon decomposition 
F = xFx®x' Fx' where F^ and F^' are positive and negative cofactors of F. This equation can be written 
as Formula 1101 the positive Davio decomposition, or as Formula I 111 the negative Davio decomposition. 

F = F^, ® x{F^ (B F^,) (10) 
F = F^®x{F^®F^,) (11) 

Table [H] shows that each 2-input Boolean function can be implemented by a reversible circuit with 
read-only inputs using at most three gates, of which at most one is a Toffoli gate. To implement a three- 
input function, cofactor it with respect to one of its inputs. Implement the first cofactor without controls 
and then implement a controlled version of the XOR of the two cofactors. This approach leads to at 
most one 4-input Toffoli gate and at most 6 smaller gates. Circuit costs can be minimized by choosing 
the cofactoring variable (pivot) so as to minimize the total costs of cofactors based on Table [51 Working 
with four-input functions, one can implement four modular-multiplication modules by one (4,n)-LUT 
by implementing cofactors as three-input functions. However, using two separate cofactoring steps may 
require five-input Toffoli gates. An alternative approach is to consider the four double-cofactors (each 
a two-input function) with respect to two variables as shown in Formula 1121 and introduce an ancilla 
to enable the fourth cofactor. This ancilla will be set and unset by a Toffoli gate and will enable the 

^For a Boolean function of n variables, a minterm is a product term of all n variables (either complemented or un- 
complemented). Each minterm can be labeled by an integer by interpreting negated literals as bits in the label. For 
example, expanding minterms for j/o leads to x'-^^x^x'^ ® x'^X2X3 ffi x'^X2x'^. 

^^In general, for a semiprime M, there are four values b such that b^%M = 1, two of them being b = ±1. The other two 
are as difficult to find as factoring M. 
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Figure 14: Modular exponentiation with M = 21 and all coprime base values implemented as (3, 5)-LUTs. 
For each circuit, the parenthesized label includes the period of modular exponentiation (boldfaced) and 
the multipliers of conditional multiplications. The four b values where b'^%M =1 (6 = 1,8, 13, 20) lead 
to particularly compact circuits, but finding such values ±1) for large M is as hard as factoring M . 
Coprime values 4, 5, 20, 16 and 17 trigger restarts in Shor's algorithm and are given only to illustrate 
the circuits. 



Table 6: Circuits for all 16 two-input functions. A/", C, and T are used for NOT, CNOT and Toffoli 
gates. Variables a and h are inputs and z is the output. No ancillae are used. 



Function 


Mintcrms 


Circuit 


0000 






0001 





r{a' ,b'.z) 


0100 


1 


T{a' ,b, z) 


0010 


2 


T(a,b',z) 


1000 


3 


T{a,b,z) 


0101 


0,1 


C{a',z) 


0011 


0,2 


C(b',z) 


0110 


1,2 


C(a,z), C{b,z) 


1001 


0,3 


C(a,z), C(f),z), N{z) 


1100 


1,3 


C{b, z) 


1010 


2,3 


C(a, z) 


0111 


0,1,2 


T{a,b,z), N(z) 


1101 


0,1,3 


T{a,b' ,z), M(z) 


1011 


0,2,3 


T[a',b,z), Miz) 


1110 


1,2,3 


T{a',b\z), N{z) 


nil 


0,1,2,3 


N{z) 



cofactor using a single control. One of the following formulas can be selected based on the costs of 
doublc-cofactors obtained from Table El 

F = Fx'y' © ^i^xy' © Fx'y') © y[Fx'y © F^'y') © Xy[Fx'y' © -fx'y © ^xy' © Fxy) 

F = FxJy © X(Fxy © Fx'y) ffi y {Fx'y (B Fx'y') (B Xy {Fx'y' © Fx'y © Fxy' (B Fx y) 

F = Fxy' © x'{Fxyi © Fx'y') © y{Fxy ffi Fxy') ffi x' y {Fx' 

F = FxyS) x'{Fxy © Fx'y) © y'{Fxy ffi Fxy') © x'y'{Fx'y' © Fx'y ffi Fxy' ffi Fxy) 

Of the four cofactors in each formula, one can be implemented without control, two with a single 
control without ancillae, and one with a single control with an ancilla. This approach leads to at 
most 12 gates, of which at most three are four-input Toffoli gates. The depth and gate count of a 
(4, n)-LUT are 0{n). Figure ITS] illustrates the result of applying the systematic synthesis to 2^%87. 
Selecting the cofactoring variable carefully, implementing the appropriate cofactor without control, and 
sharing cofactors among different functions can reduce the number of gates. Davio decompositions were 
used in [3S] to synthesize a given reversible function. However, the technique in [3S] implements the 
Davio decompositions by assuming that the factors have already been computed on dedicated ancillae. 
Therefore, the resulting circuits require numerous ancillae. The work in [25| does not clear these ancillae. 
Our (4,7i)-LUT circuits use at most one ancilla, and we clear it. 
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Figure 15: Implementation of conditional modular multiplications by 4, 16, 82, and 25 in modular 
exponentiation for 5 = 2, M = 87 as a (4, 7)-LUT. Each sub-circuit is the result of applying the 
systematic synthesis approach for one output. Each computation uses one ancilla and clears it. Further 
optimization is possible. 



7.3 Control optimization using 2-to-2 multiplexors 

A large fraction of quantum-gate costs can be attributed to controls (read-only bits) [J, and this is 
particularly true for mod-exp circuits, where each Cia;%M-block is enabled (controlled) by one bit of 
Register ij^ To avoid propagating these controls to each gate of the CiX%M-block, we observe that the 
binary 000. ..0 is a fixed point of every such block. Control can be implemented indirectly by conditionally 
swapping a constant zero into the register before the block and swapping the result out after the block 
(Figure [161) . F^'" ^ qubits, this technique requires an additional n-qubit zero- initialized swap register 
and 2n Fredkin (controlled-SWAP) gates. We merge pairs of adjacent Fredkin gates with controls from 
Register 1 and common target bits in Register 2. Indeed, Register 2 must be swapped with the swap 
register only when the two control bits carry mutually exclusive values. Therefore, we first apply a 
CNOT gate to the two controls from Register 1, then (optimized) Fredkin gates (for each qubit of 
Register 2) controlled by the target bit of the CNOT, and then we repeat the same CNOT gate to 
restore the modified control bit. This is illustrated in Figure [iTl Each Fredkin gate can be broken down 
into a single-controlled Toffoli surrounded by two CNOT gates. However, when one of the swapped 
inputs always carries a zero, the first CNOT gate can be removed. Given that Cia;%M-blocks in the 
literature contain O(n^) gates, our two optimizations bring substantial savings and simplify the structure 
of mod-exp circuits. 

Ancillae sharing. Our proposed optimizations trade off the overhead of control logic for a number of 
additional ancillae. In addition to the control register (where the Hadamard gates are applied) and the 
results register (Figure [TO)) . multiplexing requires a swap register of size n. This is separate from the 
ancillae required by our mod-mult circuits shown in Table [1] Fortunately, many ancillae already used by 
the Cx%M circuits can be reused for multiplexing under some conditions. To this end, our multiplexing 
construction guarantees that either the results register or the swap register is holding all zeros. In the 
latter case, the swap register bits can clearly be used as zero-initialized ancillae in mod-mult circuits, as 
long as we restore them before the next multiplexing which we do (at least for a; < C, as discussed for 
additive and multiplicative circuit blocks in Sections [2] and . In the former case, we need to make sure 
that when the Cx%M computation is performed with x = 0, the ancillae are restored to their (possibly 
non-zero) initial values. Consider the modular reduction in Section [2] with one comparator and one 
conditional subtraction where the comparison and conditional subtraction are performed on the value 
stored in ancillae. Consider the zero-initialized ancilla that carries the condition bit. For any value 
A in the ancillae, x = < A and we have <^ = 0. Hence, the conditional subtraction is not applied. 
Since the Cuccaro adder recovers the values in the second register (and changes the first register to the 
result), the possibly non-zero initial values in the ancillae will be recovered. Therefore, we need to add 
one ancilla to save n—1 ancillae. The added ancilla will be cleared as before. Now consider the following 
individual blocks used in our circuits. 

• —x%M. Our construction maps x = into Af. 

• 2^x%M for odd M > 2. This block contains one modular reduction but has a fixed point at x = 0. 

^^Relevant optimizations in |22l Section III.C] and [S] Section IV. D] arc costlier than ours. 
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Figure 16: Control optimization for modular exponentiation. Conditional multiplications by Ci — %M 
arc replaced by multiplexing that conditionally swaps constant zeros into the input of multiplication and 
swaps the resulting bits out. Pairs of adjacent 2-to-2 multiplexors are optimized further in Figure [TTl 
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Figure 17: Merging neighboring 2-to-2 multiplexors (left); implementing a 2-to-2 multiplexor with three- 
input controlled- SWAP /Fredkin gates (middle); optimizing Fredkin gates (right). 



• 2*a;%Af for M = 2^^ ± 1. Our construction contains several modular reductions and additions based 
on Cuccaro adder, a; = is a fixed point for Cuccaro adder. However, a non-zero value in the 
ancillae changes a; = after addition. 

• Division with remainder circuits. Our construction includes a set of modular reductions followed 
by a circuit for (2*-' -I- l)x (not modular). Assigning a; = in Formula [H] reveals that the (2^^ -I- l)x 
circuit does not change a; = as far as the ancillae used for Ci carry zero. Next, we use a set of 
Cuccaro-based modular reductions and additions. Overall, a; = is a fixed point for division with 
remainder circuits with zero-initialized ancillae. 

This analysis indicates that for 2'"'a;%M for odd M > 2, ancillae can be shared. However, the —x%M 
block complicates the proposed sharing of ancillae with 2-to-2 multiplexors. Therefore, we factor out 
such blocks, aggregate them into one as described in Section [7. 1[ and implement one conditional —x%M 
as described in Section [3] directly. Without multiplexing, the swap register must hold all zeros and can 
thus hold the ancillae of the —x%M block. For other cases with 2*a;%M for M = 2^^ ± 1 and division- 
with-remainder circuits, ancilla sharing cannot be applied and separate ancillae are needed for mod-mult 
and multiplexer modules. 

7.4 Circuit structure for modular exponentiation 

Overall structure. Summarizing the content of the above subsections, we propose mod-exp circuits 
consisting of three modules: (i) an initial LUT, (ii) an XOR-controlled negation, and {Hi) remaining 
conditional modular multiplications. The first two modules will use a linear number of gates, while the 
bulk of the circuit will be in modular multiplications. To simplify the implementation of control, we use 
uncontrolled modular multiplications with multiplexors. The size of mod-mult circuits is moderated by 
factoring out negations and by implementing the most difficult multiplications in the LUT module. For 
an n-bit modulus M, our circuits use an n-qubit results register, a 2n-qubit control register, an n-qubit 
swap register and an n-qubit ancilla register for each modular multiplication. In addition to these 5n 
qubits, less than n ancillae may be needed for arithmetic operations (such as doubling and trippling), 
but these ancillae can also be shared with the swap register. In toto, 5n to 6n qubits are used. 

Base selection. Recall that in Shor's algorithm not all values & > 1 for the base of exponentiation 
succeed — for some values the period 2tt of f{x) — h^%M is even and b^YoM ^ —1 and for others it 
is either odd or W%M = —1. It is proven [TS] that the successful case occurs with probability at least 
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50% (also check Table [7]). Therefore, common descriptions of Shor's algorithm make a random choice of 
1 < & < M, invoke period-finding, and repeat the entire process for another b if the period is either odd 
or b'^%M = —1. Obviously, when gcd(6, M) > 1, there is no need for quantum circuits, but this occurs 
increasingly rarely for large semiprime M. Therefore, when illustrating mod-exp circuits in our work, 
we observe gcd(5, M) = 1. The set of reasonable b values can be further restricted as follows. 

Theorem 7.1 Define admissible b values as those satisfying 1 < b < M and gcd(b,M) = 1. Consider 
the function fb{x) = b^%M with an admissible b value. 

• For an integer k, Period[/fc] ~ gcd(Pcriod[/ft], A:) • Period[/bfc]. In particular, if b results in an even 
period, so does b'^^'^^ for k > 0. 

• For two admissible b values bo and bi, if bo and bi produce odd periods, so does bobiYoM . 

• //5P™°'i[^''l/2%Af = -1, then the same holds for b'^^+^ for k > 0. 

Proof. Assume that p is the smallest positive number to satisfy ¥'%M = 1 and pk is the smallest positive 
number to satisfy {b^Y''%M = b^P''%M = 1. Then kp^ must be a multiple of p, or else kpk%p < p 
would be the period of b (since we can factor out multiples of p at will) . The smallest positive multiple 
of p of this form is pk/gcd{p, k). Therefore, p/gcd(p, fc) is the period of b*' . 

As for the second case, consider the smallest positive values po and pi to satisfy b^°%M ~ 1 and 
lfl^%M ~ 1, and also the smallest positive value p to satisfy (60^1)^%-^^ = 1. Since = popi/gcd{po,pi) 
is a multiple of both po and pi, it must satisfy the latter equation. Therefore, = pm for some integer 
m > (or else p^%p < p would satisfy the equation, since we can factor p out). If po and pi arc odd, 
then so is p*, and thus p cannot be even. Substituting Pcriod[/f,2fc+i] = Period[/h]/gcd(Period[/b], 2fc-|- 1) 
into 6(2fc+i) Period[/,2,+i]/2%^ ^^^^^ ^^le equation (_i)(2fe+i)/ gcd(Period[/,l,2/c+i) ^ _i proves the 

third case. | 
Theorem l7. 1 1 suggests using odd powers of primes for 60 Straightforward computational experiments 
show that small primes have much greater probability of success than 50%. Assuming that success for 
different primes is not strongly correlated, trying only 6 = 2, & = 3 and 6 = 5 can be expected to work in 
a majority of the cases. To illustrate this. Table [7] reports the percentage of semiprime M = p ■ q values 
with p,q < 2^^ where the resulting function fb{x) = b^%M for 6 = 2, 3, 5, their products 6 = 6, 10, 15, 
and their squares 6 = 4, 9, 25 has an even period r where 6''/^%M 7^ —1. In this table we exclude easy 
values of M with small p and q factors. Percentage statistics for unrestricted p, q factors are very similar. 
The rows 2|3|5, 6|10|15, and 4|9|25 show the percentage of semiprime M values for which at least one 
of 6 = 2, 3, or 5, 6 = 6, 10, or 15, and 6 = 4, 9, and 25 produces a useful period. The rows #Failed show 
the total number of M values considered for each n that do not yield a useful period with 6 = 2,3,5, 
6 = 6, 10, 15, and 6 = 4, 9, 25. Adding primes < 43 as a base for the first set and < 61 for the second 
and third sets is sufficient to ensure that a useful period can be observed in all cases. The last row 
(#Total) shows the total number of M for each n. In addition to the results reported in Table [71 we 
discovered that choosing larger primes than 6 = 2, 3, 5 as bases leads to more failed M values. Therefore, 
the smallest bases arc the most promising and can be tried first. 

Selecting the number of controls. If we find that 6*''%Af = 6™%M for some k ^ ni, that allows us 
to upper-bound the period and then find it by binary search. When factoring large integers M using 
Shor's algorithm, we can pursue different strategies for selecting the number of control qubits. Most 
of the literature shows that selecting twice as many qubits as bits in M is sufficient. However, fewer 
bits suffice in many cases. Given that physical implementations of Shor's algorithm are typically limited 
by the number of qubits, a more practical strategy is to start with a small number of qubits, perform 
number-factoring, and increase the number of qubits in case of failure. This adds at most a poly-time 
factor to runtime complexity, but also reduces circuit sizes. Assuming that modular exponentiation 
circuits generally have size on the order of n^, the difference in sizes — (71 — 1)'^ is on the order of n^, 
which can be significant. 

8 Examples of modular exponentiation 

Our first series of experiments illustrates the proposed construction of mod-exp circuits but uses only 
multiplicative circuit decompositions for individual mod-mult blocks. Multiplicative decompositions do 

^^Fortunately, primality testing is in P and can be performed very efficiently in practice. 
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Table 7: The percentage of n-bit semiprimes M = p ■ q with p,q < 2^^ and jflogjp] — [log2 (7] | < 2 
for which 6 = 2,3, 5, their products b — Q, 10, 15, and their squares 6 = 4, 9, 25 result in fb{x) = b^%M 
having an even period r where }f/'^%M ^ ~\. The rows 2|3|5, 6|10|15, and 4|9|25 show the percentage 
of semiprimes M where at least one of & = 2,3, or 5, 6 = 6,10, or 15, and & = 4,9, or 25 produces a 
useful period. 
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not require an ancilla register used by two-register circuits, but in some cases generate larger circuits, and 
for larger M values may not be able to generate some Cx%M circuits at all. Therefore, this approach 
is more relevant for small M values and an environment with a very limited number of qubits. Gate 
counts and the structure of the proposed circuits for b^%M for M with 9 functional qubits or less are 
reported in Table [5] In this table, the notation x(y) represents modular multiplication by x controlled 
on the line y. Each circuit is described by a parenthesized triplet consisting of the Toffoli gate count, 
the CNOT gate count and the number of ancillae. For each M, we initially selected 6 = 2 in modular 
exponentiation. If this triggered a restart in Shor's algorithm, we tried 6 = 3, and if that failed we 
selected 6 = 5. For each M value, we calculated all parameters Ci = 6^ %M and found the least costly 
multiplicative decomposition for each Ci according to Table [TJ The four costliest modular multiplications 
were selected for each M value, and a (4, n)-LUT was synthesized for these multiplications using our 
systematic synthesis procedure. The remaining controlled modular multiplications were implemented 
directly and connected through 2-to-2 multiplexers. The number of Toffoli and CNOT gates required 
for each mod-mult sub-circuit is reported in Table [TJ Each controlled SWAP in a 2-to-2 multiplexer 
can be implemented by one Toffoli and one CNOT gates (Figure [T7)) . For the first and last controlled 
SWAP gates, n = [logj M~\ Toff'oli and the same number of CNOT gates are applied. For intermediate 
controlled SWAPs, two additional CNOT gates are essential. Gate counts for Cx%M modules used in 
circuits of Table|8]are computed by adding up gate counts from Table[T] To simplify circuits for 4-LUTs, 
we applied the rule-based optimization method in ^ which optimizes sub-circuits with common-target 
gates and uses both negative and positive control Toffoli gates during the optimization. For each M in 
Table [51 another 6 value may admit a smaller circuit, but finding the best 6 (for a given large M) that 
is useful in number-factoring M is, in general, no easier than number-factoring. 

Our further experiments focused on scalable minimization of gate counts, but were allowed to use 
an additional n-qubit ancilla register to facilitate two-register mod-mult circuits. Figure [TSj shows the 
distributions of mod-exp circuit sizes for n = 7.. 14. Each line represents the cumulative density function 
for T-cost of mod-exp circuits constructed for all n-bit semiprime M not divisible by 2 or 3. We note that 
for a given n the median cost is about 2/3 of the maximal cost, but the smallest cost is only a fraction 
of the median cost. Table [9] reports min, max and average costs numerically, as well as the M values 
for which extreme circuit costs were observed. The data for average and max costs are amendable to 
polynomial extrapolation (i?^ > 0.999), allowing us to estimate achievable circuit costs for much greater 
values of n without necessarily having practical synthesis algorithms. However, the costs of smallest- 
seen circuits arc too erratic for reliable extrapolation. Notably, our experiments optimize the number of 
control qubits, typically assumed to be 2n. For each M , we use the smallest number that does not lead to 
failures in Shor's algorithm and report it in Table |9l along with the period found by Shor's algorithmic 

Comparing to mod-exp circuits in [20] that use lOOn ancillae, our circuits use only 5n to 6n ancillae 

^^For each semiprime M, there are two non-trivial b values, such that b'^%M = 1. While these bases lead to the most 
compact mod-exp circuits, finding them is as hard as number-factoring. To this end, the data Table [5] suggest bases 
b = 2, 3, 5 sometimes lead to unusually small circuits and short periods. 
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"d lODO 2000 3000 4000 5000 6000 7000 

Circuit costs 

Figure 18: Costs of modular exponentiation circuits for 71-bit M values 8 < 71 < 14 shown as cumulative 
distribution functions. The default base of exponentiation 6 = 2 was replaced by 3 or 5 as needed. 

and are several orders of magnitude smaller in terms of gate counts. Circuit depth seems comparable for 
n = 14. However, considering circuit depth as a measure of circuit speed assumes that any number of 
gates can be implemented in parallel, which does not hold for many existing physical implementations. In 
an environment with a limited supply of qubits and limited parallelism, our circuits appear far superior 
to those proposed earlier. Whether or not many gates can be applied in parallel, larger circuits may 
require heavier quantum error correction, and this trend favors circuits with fewer gates. 

9 Comparison with prior art 

Prior work on circuits for modular multiplication and modular exponentiation typically describes circuit 
sizes by a closed-form expression in terms of the number of input qubits. Those circuits typically take 
on the order of gates for modular multiplication and for modular exponentiation^ The best cases 
almost always exhibit the same asymptotic growth. In contrast, our circuits for modular multiplication 
by 2, 3 and 5 (as well as their inverses) require only a linear number of gates. In the more general case, 
our optimization is algorithmic in nature, therefore a closed-form expression cannot be given a priori and 
comparisons require software implementations of our proposed algorithms. To compare the asymptotic 
number of gates in the proposed mod-mult and mod-exp circuits, we use the trend lines for maximum 
and average gate counts. 

9.1 Modular multiplication 

In [5], circuits for ri-qubit modular multiplication uses n conditional mod M additions. The addition 
mod M is constructed by a multiplexed adder and a comparison operator where the former is based on 
multiplexed full and half adders. Considering one enable bit in [SJ Formulae 5.12 & 5.17] for multiplexed 
fuU and half adders leads to [2,2,2, 10 and [2,2, 1,0] gates in the worst case. Hence, worst-case gate 
counts for mod-mult in [S] are given by Formula [T51 [SJ Formula 6.4] leading to [IQn^ — 16n, 8n^ + 16n — 
18, 24n2 - 56n 4- 24, An'^ -Sn + A] gates. 

4(n - If [2, 2, 2, 1] + 4(n - 1)[2, 2, 1, 0] + 8(n - l)[n, 2, 2n - 3, 0] + 

4(n - 1) [0, 0, 1, 0] 4- 2(n - 1) [0, 1, 0, 0] + 2[0, ?i, 0, 0] + 2[0, 2n, 0, 0] ^ ' 

Following [5j Formula 6.4] leads to Formula [14] for the average gate count in mod-mult. Similar to 
the worst case, 2n -I- 1 ancillac arc used and cleared at the end of computation. 

4(n - If [1/2, 3/2, 3/2, 1/2] + 4(n - l)[l/2, 5/4, 1/2, 0] + 

8(n - l)[n - 1/2, 3/2, 3/2n - 5/2, 0] + 4(n - 1)[0, 0, 1, 0] + 2{n - 1)[0, 1, 0, 0]+ (14) 
2[0, l/2n,0,0] -H2[0,n,0,0] 



QFT-based circuits exhibit slower asymptotic growth, but are viewed impractical for < 1000 qubits or less. 
Following [5], [co, ci, C2, C3] indicates a circuit with co NOT, ci CNOT, C2 C^NOT, and C3 C^NOT gates. 
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Tabic 8: The structure and gate count of circuits for }f%M for M with 9 bits or less. The notation 
x{y) represents modular multiplication by x controlled on the line y. Each circuit is described by 
a parenthesized triplet consisting of the T gate count, the C gate count and the number of ancillae. 
Circuits for M = 15 and M = 21 are illustrated in Figure |3] and Figure [T4l respectively. All ancillae are 



cleared. Gray cells indicate circuits without ancillae sharing. M values with 6^2 are boldfaced. 
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(36,40,9) 


(372,339,10) 


393 


5 


130 


376(4) 


289(5) 


205(6) 


367(7) 


5(1), 25 (2), 232 (3), 283(8) 
232 = — 2~^^, 283 — 5~^ 


(61.20,1) 


(1936,1223,13) 


(45,51,9) 


(2042,1294,22) 


395 


2 


156 


361(5) 


366(6) 


51(7) 


231(8) 


2(1), 4(2), 16(3), 256(4) 


(63,14,1) 


(570,615,10) 


(45,51,9) 


(678,680,10) 


403 


2 


60 


16(3) 


256(4) 


250(5) 


35(6) 


2(1),4(2) 


(72,9,1) 


(114,123,10) 


(27,29,9) 


(213,161,10) 


407 


2 


180 


9(5) 


81(6) 


49(7) 


366(8) 


2(1), 4(2), 16(3), 256(4) 


(52,10,1) 


(570,615,10) 


(45,51,9) 


(667,676,10) 


411 


2 


68 


16(3) 


256(4) 


187(5) 


34(6) 


2(1), 4(2), 334(7); 334 = 2"'=' 


(64,9,1) 


(266,287,10) 


(36,40,9) 


(366,336,10) 


413 


2 


174 


282(5) 


228(6) 


359(7) 


25(8) 


2(1), 4(2), 16(3), 256(4) 


(71,11,1) 


(570,615,10) 


(45,51,9) 


(686,677,10) 


415 


2 


164 


381(5) 


326(6) 


36(7) 


51(8) 


2(1), 4(2), 16(3), 256(4) 


(58,14,1) 


(570,615,10) 


(45,51,9) 


(673,680,10) 


417 


5 


138 


25(2) 


259(6) 


361(7) 


217(8) 


5(1), 208(3), 313 (4), 391 (5) 
208 = -2~1,313 = 2~2,391 = 2~^ 


(66,16,1) 


(584,471,13) 


(45,51,9) 




427 


2 


60 


16(3) 


256(4) 


205(5) 


179(6) 


2(1),4(2) 


(71,11,1) 


(114,123,10) 


(27,29,9) 


(212,163,10) 


437 


2 


198 


423(5) 


196(6) 


397(7) 


289(8) 


2(1), 4(2), 16(3), 256(4) 


(61,15,1) 


(570,615,10) 


(45,51,9) 


(676,681,10) 


445 


2 


44 


16(3) 


256(4) 


121(5) 


401(6) 


2(1),4(2) 


(65,10,1) 


(114,123,10) 


(27,29,9) 


(206,162,10) 


447 


2 


148 


274(5) 


427(6) 


400(7) 


421(8) 


2(1), 4(2), 16(3), 256(4) 


(60,14,1) 


(570,615,10) 


(45,51,9) 


(675,680,10) 


451 


2 


20 


4(2) 


16(3) 


256(4) 


141(5) 


2(1) 


(68,9,1) 


(38,41,10) 


(18,18,9) 


(124,68,10) 


453 


2 


30 


4(2) 


16(3) 


256(4) 


304(5) 


2(1) 


(63,12,1) 


(38,41,10) 


(18,18,9) 


(119,71,10) 


469 


2 


66 


16(3) 


256(4) 


345(5) 


368(6) 


2(1), 4(2), 352(7); 352 = 2~^ 


(58,16,1) 


(190,205,10) 


(36,40,9) 


(284,261,10) 


471 


2 


52 


16(3) 


256(4) 


67(5) 


250(6) 


2(1),4(2) 


(82,8,1) 


(114,123,10) 


(27,29,9) 


(223,160,10) 


473 


3 


210 


81(3) 


185(6) 


169(7) 


181(8) 


3(1), 9(2), 412(4), 410(5) 
412 = -2~^ ■ 3 ■ 5, 410 = 3~^ ■ 5~^ 


(69,18,1) 


(2042,1193,12) 


(45,51,9) 


(2156,1262,21) 


481 


3 


18 


9(2) 


81(3) 


308(4) 


107(5) 


3(1) 


(64,13,1) 


(262,133,12) 


(18,18,9) 


(344,164,21) 


485 


2 


48 


16(3) 


256(4) 


61(5) 


326(6) 


2(1), 4(2) 


(74,9,1) 


(114,123,10) 


(27,29,9) 


(215,161,10) 


493 


2 


56 


16(3) 


256(4) 


460(5) 


103(6) 


2(1), 4(2) 


(64,14,1) 


(114,123,10) 


(27,29,9) 


(205,166,10) 


497 


3 


210 


81(3) 


121(6) 


228(7) 


296(8) 


3(1), 9(2), 100(4), 60(5) 
100 = 5~ ^ ■ 3, 60 = 2^^ 


(61,15,1) 


(1766,1130,13) 


(45,51,9) 




501 


2 


166 


406(5) 


7(6) 


49(7) 


397(8) 


2(1), 4(2). 16(3), 256(4) 


(62,16,1) 


(570,615,10) 


(45.51,9) 


(677,682,10) 


511 


3 


12 


3(1) 


9(2) 


81(3) 


429(4) 




(54,6,1) 


(54,6,1) 
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Table 9: T-Costs for modular exponentiation circuits for n-bit M values not divisible by 2 and 3. All AI 
values divisible by 5 were factored with 6 = 2 or 6 = 3. Trend lines were extrapolated using 6 = 2 and 
n ~ 9. .14. Parameters I and tt represent the number of controls and the period, respectively. Max and 
min I may not correspond to max-cost and min-cost circuits for a given n. Numbers in [ ] are C in LUT. 



Bit! 


7~-costs for b — 2 


# of lines for b — 2 


{M, b,Cost,i,7r) 


for extreme circuits 


Vlin circuit costs 


n 


Min 


Max 


Avg 


Min 


Max 


Avg 


Vlin cost 


1 Max cost 


;4-LUT, Mod-mult, Mux) 


7 


36 


150 


97.7 


3 


6 


4.8 


85,2,36,3,8) 


(115,2,150,6,44) 


36,0,0) [2, 4, 16] 


8 


99 


326 


192.4 


5 


7 


5.9 


;217,5,39,3,6) 


1(209,3,655,7,90) 


;39,0,0) [5, 25, 191] 


9 


61 


631 


375.9 


4 


8 


6.8 


511,3,54,4,12) 


(497,3,1191,8,210) 


54,0,0) [3, 9, 81, 429] 


10 


121 


1099 


689.4 


5 


9 


7.8 


;635,2,121,5,28) 


(713,3,1747,9,330) 


58,43,20)[4, 16, 256, 131] 


11 


68 


1691 


992.3 


4 


10 


8.3 


;i285,2,68,4,16) 


(1841,3,2584,10,786) 


■68,0,0)[2, 4, 16, 256] 


12 


146 


2511 


1624.5 


5 


11 


9.4 


;4069,5,46,3,8) 


(3817,5,3601,11,1730) 


;46,0,0)[5, 25, 625] 


13 


75 


3463 


2332.6 


4 


12 


10.3 


;5461,2,75,4,14) 


(8153,3,4876,12,3930) 


■75,0,0)[2, 4, 16, 256] 


14 


179 


4680 


3224.9 


5 


13 


11.2 


;i0261,2,179,5,30) 


1(14849,3,6282,13,7170) 


;88,63,28)[4, 16, 256, 3970] 


Trend lines for 6 


= 2 


Max(ri) 


=3.861n-* - 40.61ri^ + 187. 3n - 578.9 Avg(n) = 1.979n^ + 12.32n^ - 512. 9ri + 2563.0 



Extrapolated values for 6 — 2 ri: (max, avg) 



I 20: (17811,13065); 50: (389886,255093); 100: (3473051,2053473); 200: (29300481,16224783); 300: (100647711,54390493) | 

To account for the number of T and C gates, one can apply the cost model in |14j^ which leads to 
Sri^ + 16n - 18 C and 3671^ - 80n + 36 T gates in the worst case and 6n'^ - 16n + 13 C and 24n'^ - 50n + 26 
r in the average case with 2n + l ancillae — (486 C, 1240 T), (622 C,1700 T), (774 C,2232 T), (942 C,2836 
T), (1126 C, 3512 T), (1326 C, 4260 T) for n = 7, 8 • • • 12 in the worst case and (195 C, 852 T), (269 C, 
1162 T), (355 C, 1520 T), (453 C, 1926 T), (563 C, 2380 T), (685 C, 2882 T) for n = 7, 8 ... 12 in the 
average case. JVIore recent work optimizes ancillae [T^ and circuit depth [5D], resulting in larger circuits. 
The trend lines for T-cost in the proposed modular niultiplcation circuits are 5.309n^ — 11.59n + 4.5, 
and 3.351??^ + 7.127n — 78.57 in the worst and average cases, respectively (Table[5]). 

9.2 Modular exponentiation 

In [S], n-qubit modular exponentiation is constructed from « 2n conditional modular multiplications. 
For a modular multiplication with an enable bit, n conditional mod M additions are chained. Hence, 
each mod AI addition has a pair of enable bits. The average CNOT and Toffoli gate counts for n-qubit 
modular exponentiation arc 14ji^ + 5n^ — 18n + 13, and 46n'^ — 107n^ + 92?? — 25, respectively [5]. In 
this configuration, 2n -f- 3 ancillae are used and cleared at the end of computation. 

In [22j . modular exponentiation is performed by setting Register 2 to |1) and applying n conditional 
mod-mult %M modules followed by a controlled multiplication network %M that clears the 
ancillae: 7n -)- 1 ancillae in total. Overall the algorithm needs 20n^ — 5r? adders with 4r? — 3 C and 4n — 4 
T gates which leads to 96r?^ — 84??^ -I- 15?! C and 80??^ — IOOti^ + 20n T gates. The adder structure of 
[22] was improved in [50] to include 2?? — 3 C and 3?? — 3 T gates leading to 40??'^ — 70??^ + 15" C and 
607^^ — 75?^^ + 15r? 7~ gates for modular exponentiation. 

Modular multiplications in our proposed structure are unconditional (Section 17. 3p . To consider the 
effect of structural and algorithmic optimizations for modular exponentiation without considering the 
effect of ideas proposed for modular multiplication, here we use the same structure as in [S] except that 
each mod M addition has a single enable bit whereas two enable bits were used in [5]. The average 
numbers of CNOT and Toffoli gates in modular multiplication are 6n^ — 16r? + 13 and 24??-^ — 50?? + 26, 
respectively. 

The first and last controlled SWAPs in a 2-to-2 multiplexer needs n Toffoli and n CNOT gates. Other 
controlled SWAP gates need two additional CNOTs. Finally, note that gate count of a (4, ??)-LUT are 
0(71). Precisely, 1/2 x (8n) C^NOT gates are required for (4, r?)-LUT on average with one ancilla. With 
two zero-initialized ancillae, each C'^NOT gate can be decomposed into 5 Toffoli gates. Overall, the 
average number of Toffoli gates is 20n. 

Combining the above calculations with our proposed structure for modular exponentiation shown in 
Formula [la leads to Qv? - 39??^ + 76?? - 62 CNOT, and 247?^ - 1457?^ + 2A3n - 104 Toffoli gates. As for 
the number of ancillae, aside from 27t, + 1 for mod-mult, we need 2?t, ancillae for Register 1 and Register 2 
in Shor's algorithm, and 7? ancillae for swap register — 57? -|- 1 in total. Applying the proposed mod-mult 
circuits in mod-exp instead of [5] reduces the leading orders in C and T gates from 67?'^ and 247?"^ to 
5.309?T.^ and 3.351??'^, respectively. 

'^^This cost model evaluates circuit implementation via estimating the number of two-qubit gates required to implement 
it. Inverters are ignored because they may be merged into 2-qubit gates. In this generic model, an n-qubit Toffoli gate 
(either with positive or negative controls) can be decomposed into 2n — 5 Tgates. 
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20n(0 C,lT) + {n- 4:){6n^ - 16n + 13 C, 2An'^ - 50n + 26 r) + 
2(n C,n T) + (n - 5){n + 2 C,n T) 



(15) 



In |20j . depth-optimized circuits for modular exponentiation were constructed by parallelizing mod- 
ular multiplications and using depth-optimized adders. With arbitrary-distance interaction between 
qubits, the authors reduced the asymptotic depth of modular exponentiation to 0(rt log^n). However, 
their circuits need ^ lOOn ancillae, use a large number of gates, and assume unbounded gate parallelism, 
which can make them impractical with current technologies. For n = 128, the latency (circuit depth) 
of the best technique in [501 Algorithm E, Table II] is 1.96 x 10"* CNOT, and 1.71 x 10^ Toffoh gates 
with 12657 ancillae. For n = 128 our mod-cxp circuits need 1.1 x 10^ CNOT, and 7.0 x 10^ Toffoh gates 
with 641 ancillae. If [201 algorithm G, Table II] with 660 ancillae is used for comparison, the latency is 
2.48 X 10^ C, and 1.50 x 10^ T gates. Even though our circuits are not optimized for depth, the actual 
number of gates seems comparable to the depth of depth-optimized circuits in |20j . 

10 Conclusions and future research 

In this paper, we proposed linear-size circuits for several special cases in modular multiplication and 
used them to develop a shortest-path formalism for finding compact generic mod- mult circuits. Our 
results can be viewed as the first illustration of automated logic synthesis and optimization for modular 
multiplication circuits with superior results compared to mathematical circuit constructions. Our circuits 
are also the first not to require Bennett's technique, and this produces significant savings. The above 
results are directly applicable to modular exponentiation circuits, for which we propose several additional 
improvements . 

Another first in our research is the use of register-transfer level (RTL) primitives to optimize reversible 
circuits, where previous techniques for reversible logic optimization operate at the bit level }17| . This 
higher-level perspective facilitates much greater scalability than for previous algorithms. Additionally, 
the RTL primitives wc proposed in Table [5] are good candidates for direct implementations in terms 
of specific quantum technologies. Such implementations may be faster and less error-prone than the 
decompositions into elementary gates that we have shown. They can also support a higher level of 
programming of quantum computers, where sequences of operators demonstrated in Tables [¥] and [5] can 
be issued directly to the quantum computer without intermediate levels of software translation. 

Despite concrete evidence of smaller circuits for mod-exp, our research leaves a number of open 
challenges. In particular, the algorithms for synthesizing mod-exp circuits that we have implemented 
do not currently scale beyond 15-bit M values. More scalable techniques must be developed, perhaps, 
by giving up the optimality of results. Departing from register-level structure of our current mod-mult 
circuits, bit-level local optimization |17j may reduce gate counts. Further reductions may be achievable 
by leaving the Boolean domain [Uj. 
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